Jeff’s post about the deterministic statistical machine got me thinking a bit about the cost of data analysis. The cost of data analysis these day is in many ways going up. The data being collected are getting bigger and more complex. Analyzing these data require more expertise, more storage hardware, and more computing power. In fact the analysis in some fields like genomics is now more expensive than the collection of the data [There’s a graph that shows this but I can’t seem to find it anywhere; I’ll keep looking and post later. For now see here.].
However, that’s really about the dollars and cents kind of cost. The cost of data analysis has gone very far down in a different sense. For the vast majority of applications that look at moderate to large datasets, many many statistical analyses can be conducted essentially at the push of a button. And so there’s not cost in continuing to analyze data until a desirable result is obtained. Correcting for multiple testing is one way to “fix” this problem. But I personally don’t find multiple testing corrections to be all that helpful because ultimately they still try to boil down a complex analysis into a simple yes/no answer.
In the old days (for example when Rafa was in grad school), computing time was precious and things had to be planned out carefully, starting with the planning of the experiment and continuing with the data collection and the analysis. In fact, much of current statistical education is still geared around the idea that computing is expensive, which is why we use things like asymptotic theorems and approximations even when we don’t really have to. Nowadays, there’s a bit of a “we’ll fix it in post” mentality, which values collecting as much data as possible when given the chance and figuring out what to do with it later. This kind of thinking can lead to (1) small big data problems; (2) poorly designed studies; (3) data that don’t really address the question of interest to everyone.
What if the cost of data analysis were not paid in dollars but were paid in some general unit of credibility. For example, Jeff’s hypothetical machine would do some of this.
By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.
So with each additional analysis of the data, you get an additional piece of paper added to your analysis paper trail. People can look at the analysis paper trail and make of it what they will. Maybe they don’t care. Maybe having a ton of analyses discredits the final results. The point is that it’s there for all to see.
I do not think what we need is better methods to deal with multiple testing. This is simply not a statistical issue. What we need is a way to increase the cost of data analysis by preserving the paper trail. So that people hesitate before they run all pairwise combinations of whatever. Reproducible research doesn’t really deal with this problem because reproducibility only really requires that the final analysis is documented.
In other words, let the paper trail be the price of pushing the button.
As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.
StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$.
So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare.
The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.
The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better?