A deterministic statistical machine

Tweet about this on TwitterShare on Facebook11Share on Google+0Share on LinkedIn1Email this to someone

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$. 

So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare

The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better? 

  • Pingback: Sunday data/statistics link roundup (1/6/2013) | Simply Statistics()

  • http://twitter.com/cciotti chris ciotti

    It sounds like an interesting approach to the problem. The software poses an interesting challenge; sounds like a good graduate project.

  • cabaskett

    I really like this idea! I'm a grad student in ecology and evolution, and I really struggle with statistics, especially how it seems like more of an art than a science sometimes. I would so much rather have a machine give me an answer that I can't mess with than be able to fiddle around under the hood. I just want my car to run correctly, I don't want to hot-rod it. I would just be concerned with the problem you mention: that user-friendly software makes it easy to screw up the analysis!

  • Carlos

    It sounds like a hybrid between Tableau and Rattle.

  • Vassil

    I think this is a great idea! In fact, something like this should be a rule for the bio-medical scientists. The Duke university saga is but one that was brought to daylight and there are many many more instances of negligence and fraud.