Simply Statistics


Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.

Happy data crunching!

Data Collection/Synthesis

  1. Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
  2. Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like WordPress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)

Data Analyses

  1. Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
  2. You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
  3. Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
  4. Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish
  5. Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).

Data visualization

  1. Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish). 
  2. The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)

Graham & Dodd's Security Analysis: Moneyball for...Money

The last time I posted something about finance I got schooled by people who actually know stuff. So let me just say that I don’t claim to be an expert in this area, but I do have an interest in it and try to keep up the best I can.

One book I picked up a little while ago was Security Analysis by Benjamin Graham and David Dodd. This is the “bible of value investing” and so I mostly wanted to see what all the hubbub was about. In my mind, the hubbub is well-deserved. Given that it was originally written in 1934, the book has stood the test of time (the book has been updated a number of times since then). It’s quite readable and, I guess, still relevant to modern-day investing. In the 6th edition the out-of-date stuff has been relegated to an appendix. It also contains little essays (of varying quality) by modern-day value investing heros like Seth Klarman and Glenn Greenberg. It’s a heavy book though and I’m wishing I’d got it on the Kindle.

It occurred to me that with all the interest in data and analytics today, Security Analysis reads a lot like the Moneyball of investing. The two books make the same general point: find things that are underpriced/underappreciated and buy them when no one’s looking. Then profit!

One of the basic points made early on is that roughly speaking, you can’t judge a security by its cover. You need to look at the data. How novel! For example, at the time bonds were considered safe because they were bonds, while stocks (equity) were considered risky because they were stocks. There are technical reasons why this is true, but a careful look at the data might reveal that the bonds of one company are risky while the stock is safe, depending on the price at which they are trading. The question to ask for either type of security is what’s the chance of losing money? In order to answer that question you need to estimate the intrinsic value of the company. For that, you need data.

The functions of security analysis may be described under three headings: descriptive, selective, and critical. In its more obvious form, descriptive analysis consists of marshalling the important facts relating to the issue [security] and presenting them in a coherent, readily intelligible manner…. A more penetrating type of description seeks to reveal the strong and weak points in the position of an issue, compare its exhibit with that of others of similar character, and appraise the factors which are likely to influence its future performance. Analysis of this kind is applicable to almost every corporate issue, and it may be regarded as an adjunct not only to investment but also to intelligent speculation in that it provides an organized factual basis for the application of judgment.

Back in Graham & Dodd’s day it must have been quite a bit harder to get the data. Many financial reports that are routinely published today by public companies were not available back then. Today, we are awash in easily accessible financial data and, one might argue as a result of that, there are fewer opportunities to make money. 


Prediction: the Lasso vs. just using the top 10 predictors

One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction. 

Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model:

Y = B0 + B1 X1 + B2 X2 + … + BM XM + E

subject to a constraint on the sum of the absolute value of the B coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s type M errors). 

I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that some people around here started calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid Stigler’s law of eponymy (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before at least once). 

Here is how the Leekasso works. You fit each of the models:

Y = B0 + BkXk + E

take the 10 variables with the smallest p-values from testing the Bk coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10. 

For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.

Here is the setup:

  • I simulated 500 variables and 100 samples for each study, each N(0,1)
  • I created an outcome that was 0 for the first 50 samples, 1 for the last 50
  • I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model Xi = b0i + b1iY + e (this is an important choice, more later in the post) 
  • I tried different levels of signal to the truly predictive features
  • I generated two data sets (training and test) from the exact same model for each scenario
  • I fit the Lasso using the lars package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set
  • I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets. 

The R code for this analysis is available here and the resulting data is here.

The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is here

Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10). 

Some thoughts on this analysis:

  1. This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction. 
  2. The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.
  3. The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: Xi = b0i + b1iY + e, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: Y = B0 + B1 X1 + B2 X2 + … + BM XM + E, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa. 
  4. I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice paper discussing shrinkage in the context of genomic prediction that I think is related. 


Professional statisticians agree: the knicks should start Steve Novak over Carmelo Anthony

A week ago, Nate Silver tweeted this:

Since Lin became starting PG, Knicks have outscored opponents by 63 with Novak on the floor. Been outscored by 8 when he isn’t.

In a previous post we showed the plot below. Note that Carmelo Anthony is in ball hog territory. Novak plays the same position as Anthony but is a three point specialist. His career three point FG% of 42% (253-603) puts him 10th all time! It seems that with Lin in the lineup he is getting more open shots and helping his team. Should the Knicks start Novak? 

Hat tip to David Santiago.