The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes: Best quote from NAS DS Round Table: "I mean, do we need deep learning to analyze 30 subjects?" - B Caffo @simplystats #datascienceinreallife — CMU Stats (@CMU_Stats) May 1, 2017 When I saw that quote I was reminded of the blog post Don’t use hadoop - your data isn’t that big.
This is a great article about the illusion of progress in machine learning. In part, I think it explains why the Leekasso (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data.
One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction. Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model: Y = B + B1 X1 + B2 X2 + … + BM XM + E