Tag: prediction


Sunday Data/Statistics Link Roundup (10/14/12)

  1. A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. 
  2. A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. 
  3. A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
  4. An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? 
  5. Seriously guys, this is getting embarrassing. An article appears in the New England Journal “finding” an association between chocolate consumption and Nobel prize winners.  This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.

Prediction contest

I have been seeing this paper all over Twitter/the blogosphere. It’s a sexy idea: can you predict how “high-impact” a scientist will be in the future. It is also a pretty flawed data analysis…so this weeks prediction contest is to identify why the statistics in this paper are so flawed. In my first pass read I noticed about 5 major flaws.

Editor’s note: I posted the criticisms and the authors respond here: http://disq.us/8bmrhl


Prediction: the Lasso vs. just using the top 10 predictors

One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction. 

Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model:

Y = B0 + B1 X1 + B2 X2 + … + BM XM + E

subject to a constraint on the sum of the absolute value of the B coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s type M errors). 

I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that some people around here started calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid Stigler’s law of eponymy (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before at least once). 

Here is how the Leekasso works. You fit each of the models:

Y = B0 + BkXk + E

take the 10 variables with the smallest p-values from testing the Bk coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10. 

For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.

Here is the setup:

  • I simulated 500 variables and 100 samples for each study, each N(0,1)
  • I created an outcome that was 0 for the first 50 samples, 1 for the last 50
  • I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model Xi = b0i + b1iY + e (this is an important choice, more later in the post) 
  • I tried different levels of signal to the truly predictive features
  • I generated two data sets (training and test) from the exact same model for each scenario
  • I fit the Lasso using the lars package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set
  • I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets. 

The R code for this analysis is available here and the resulting data is here.

The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is here

Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10). 

Some thoughts on this analysis:

  1. This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction. 
  2. The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.
  3. The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: Xi = b0i + b1iY + e, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: Y = B0 + B1 X1 + B2 X2 + … + BM XM + E, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa. 
  4. I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice paper discussing shrinkage in the context of genomic prediction that I think is related.