Simply Statistics

31
Aug
31
Aug
30
Aug
30
Aug
29
Aug
29
Aug

Increasing the cost of data analysis

Jeff’s post about the deterministic statistical machine got me thinking a bit about the cost of data analysis. The cost of data analysis these day is in many ways going up. The data being collected are getting bigger and more complex. Analyzing these data require more expertise, more storage hardware, and more computing power. In fact the analysis in some fields like genomics is now more expensive than the collection of the data [There’s a graph that shows this but I can’t seem to find it anywhere; I’ll keep looking and post later. For now see here.].

However, that’s really about the dollars and cents kind of cost. The cost of data analysis has gone very far down in a different sense. For the vast majority of applications that look at moderate to large datasets, many many statistical analyses can be conducted essentially at the push of a button. And so there’s not cost in continuing to analyze data until a desirable result is obtained. Correcting for multiple testing is one way to “fix” this problem. But I personally don’t find multiple testing corrections to be all that helpful because ultimately they still try to boil down a complex analysis into a simple yes/no answer.

In the old days (for example when Rafa was in grad school), computing time was precious and things had to be planned out carefully, starting with the planning of the experiment and continuing with the data collection and the analysis. In fact, much of current statistical education is still geared around the idea that computing is expensive, which is why we use things like asymptotic theorems and approximations even when we don’t really have to. Nowadays, there’s a bit of a “we’ll fix it in post” mentality, which values collecting as much data as possible when given the chance and figuring out what to do with it later. This kind of thinking can lead to (1) small big data problems; (2) poorly designed studies; (3) data that don’t really address the question of interest to everyone.

What if the cost of data analysis were not paid in dollars but were paid in some general unit of credibility. For example, Jeff’s hypothetical machine would do some of this.

By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

So with each additional analysis of the data, you get an additional piece of paper added to your analysis paper trail. People can look at the analysis paper trail and make of it what they will. Maybe they don’t care. Maybe having a ton of analyses discredits the final results. The point is that it’s there for all to see.

I do not think what we need is better methods to deal with multiple testing. This is simply not a statistical issue. What we need is a way to increase the cost of data analysis by preserving the paper trail. So that people hesitate before they run all pairwise combinations of whatever. Reproducible research doesn’t really deal with this problem because reproducibility only really requires that the final analysis is documented.

In other words, let the paper trail be the price of pushing the button.

28
Aug
28
Aug
27
Aug

A deterministic statistical machine

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$. 

So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare

The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better? 

26
Aug

Sunday data/statistics link roundup (8/26/12)

First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links:

  1. Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court. 
  2. A really nice interview with David Spiegelhalter on Statistics and Risk. David runs the Understanding Uncertainty blog and published a recent paper on visualizing uncertainty. My favorite line from the interview might be: “There is a nice quote from Joel Best that “all statistics are social products, the results of people’s efforts”. He says you should always ask, “Why was this statistic created?” Certainly statistics are constructed from things that people have chosen to measure and define, and the numbers that come out of those studies often take on a life of their own.”
  3. For those of you who use Tumblr like we do, here is a cool post on how to put technical content into your blog. My favorite thing I learned about is the Github Gist that can be used to embed syntax-highlighted code.
  4. A few interesting and relatively simple stats for projecting the success of NFL teams.  One thing I love about sports statistics is that they are totally willing to be super ad-hoc and to be super simple. Sometimes this is all you need to be highly predictive (see for example, the results of Football’s Pythagorean Theorem). I’m sure there are tons of more sophisticated analyses out there, but if it ain’t broke… (via Rafa). 
  5. My student Hilary has a new blog that’s worth checking out. Here is a nice review of ProjectTemplate she did. I think the idea of having an organizing principle behind your code is a great one. Hilary likes ProjectTemplate, I think there are a few others out there that might be useful. If you know about them, you should leave a comment on her blog!
  6. This is ridiculously cool. Man City has opened up their data/statistics to the data analytics community. After registering, you have access to many of the statistics the club uses to analyze their players. This is yet another example of open data taking over the world. It’s clear that data generators can create way more value for themselves by releasing cool data, rather than holding it all in house. 
  7. The Portland Public Library has created a website called Book Psychic, basically a recommender system for books. I love this idea. It would be great to have a recommender system for scientific papers