Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.
Happy data crunching!
The last time I posted something about finance I got schooled by people who actually know stuff. So let me just say that I don’t claim to be an expert in this area, but I do have an interest in it and try to keep up the best I can.
One book I picked up a little while ago was Security Analysis by Benjamin Graham and David Dodd. This is the “bible of value investing” and so I mostly wanted to see what all the hubbub was about. In my mind, the hubbub is well-deserved. Given that it was originally written in 1934, the book has stood the test of time (the book has been updated a number of times since then). It’s quite readable and, I guess, still relevant to modern-day investing. In the 6th edition the out-of-date stuff has been relegated to an appendix. It also contains little essays (of varying quality) by modern-day value investing heros like Seth Klarman and Glenn Greenberg. It’s a heavy book though and I’m wishing I’d got it on the Kindle.
It occurred to me that with all the interest in data and analytics today, Security Analysis reads a lot like the Moneyball of investing. The two books make the same general point: find things that are underpriced/underappreciated and buy them when no one’s looking. Then profit!
One of the basic points made early on is that roughly speaking, you can’t judge a security by its cover. You need to look at the data. How novel! For example, at the time bonds were considered safe because they were bonds, while stocks (equity) were considered risky because they were stocks. There are technical reasons why this is true, but a careful look at the data might reveal that the bonds of one company are risky while the stock is safe, depending on the price at which they are trading. The question to ask for either type of security is what’s the chance of losing money? In order to answer that question you need to estimate the intrinsic value of the company. For that, you need data.
The functions of security analysis may be described under three headings: descriptive, selective, and critical. In its more obvious form, descriptive analysis consists of marshalling the important facts relating to the issue [security] and presenting them in a coherent, readily intelligible manner…. A more penetrating type of description seeks to reveal the strong and weak points in the position of an issue, compare its exhibit with that of others of similar character, and appraise the factors which are likely to influence its future performance. Analysis of this kind is applicable to almost every corporate issue, and it may be regarded as an adjunct not only to investment but also to intelligent speculation in that it provides an organized factual basis for the application of judgment.
Back in Graham & Dodd’s day it must have been quite a bit harder to get the data. Many financial reports that are routinely published today by public companies were not available back then. Today, we are awash in easily accessible financial data and, one might argue as a result of that, there are fewer opportunities to make money.
One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction.
Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model:
Y = B0 + B1 X1 + B2 X2 + … + BM XM + E
subject to a constraint on the sum of the absolute value of the B coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s type M errors).
I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that some people around here started calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid Stigler’s law of eponymy (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before at least once).
Here is how the Leekasso works. You fit each of the models:
Y = B0 + BkXk + E
take the 10 variables with the smallest p-values from testing the Bk coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10.
For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.
Here is the setup:
The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is here.
Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10).
Some thoughts on this analysis:
A week ago, Nate Silver tweeted this:
Since Lin became starting PG, Knicks have outscored opponents by 63 with Novak on the floor. Been outscored by 8 when he isn’t.
In a previous post we showed the plot below. Note that Carmelo Anthony is in ball hog territory. Novak plays the same position as Anthony but is a three point specialist. His career three point FG% of 42% (253-603) puts him 10th all time! It seems that with Lin in the lineup he is getting more open shots and helping his team. Should the Knicks start Novak?
Hat tip to David Santiago.