31
Jul

The researcher degrees of freedom - recipe tradeoff in data analysis

Tweet about this on Twitter17Share on Facebook3Share on Google+0Share on LinkedIn0Email this to someone

An important concept that is only recently gaining the attention it deserves is researcher degrees of freedom. From Simmons et al.:

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

So far, researcher degrees of freedom has primarily been used with negative connotations. This probably stems from the original definition of the idea which focused on how analysts could "manufacture" statistical significance by changing the way the data was processed without disclosing those changes. Reproducible research and distributed code would of course address these issues to some extent. But it is still relatively easy to obfuscate dubious analysis by dressing it up in technical language.

One interesting point that I think sometimes gets lost in all of this is the  researcher degrees of freedom - recipe tradeoff. You could think of this as the bias-variance tradeoff for big data.

At one end of the scale you can  allow the data analyst full freedom, in which case researcher degrees of freedom may lead to overfitting and open yourself up to the manufacture of statistical results (optimistic significance or point estimates or confidence intervals). Or you can require a recipe for every data analysis which means that it isn't possible to adapt to the unanticipated quirks (missing data mechanism, outliers, etc.) that may be present in an individual data set.

As with the bias-variance tradeoff, the optimal approach probably depends on your optimality criteria. You could imagine fitting a model that minimizes the mean squared error for fitting a linear model where you do not constrain the degrees of freedom in any way (that might represent an analysis where the researcher tries all possible models, including all types of data munging, choices of which observations to drop, how to handle outliers, etc.) to get the absolute best fit. Of course, this would likely be a strongly overfit/biased model. Alternatively you could penalize the flexibility allowed to the analyst. For example, you minimize a weighted criteria like:

 \sum_{i=1}^n (y_i - b_0 x_{i1} + b_1 x_{i2})^2 + Researcher \; Penalty(\vec{y},\vec{x})

Some examples of the penalties could be:

  •  \lambda \times \sum_{i=1}^n 1_{researcher\; dropped \; y_i , x_i\ ; from \; analysis}
  • \lambda \times \#\{of\;transforms\;tried\}
  •  \lambda \times \#{Outliers \; removed \; ad-hoc}

You could also combine all of the penalties together into the "elastic researcher net" type approach. Then as the collective pentalty  \lambda \rightarrow \infty you get the DSM, like you have in a clinical trial for example. As \lambda \rightarrow 0 you get fully flexible data analysis, which you might want for discovery.

Of course if you allow researchers to choose the penalty you are right back to a scenario where you have degrees of freedom in the analysis (the problem you always get with any penalized approach). On the other hand it would make it easier to disclose how those degrees of freedom were applied.

  • Frank Farach

    To the extent that any particular instance of a researcher degree of freedom can be programmatically codified, couldn't one use a bootstrap approach to compute confidence intervals that take the decision into account? If there are multiple such data transformations (e.g., transforming a variable, then eliminating outliers beyond a certain threshold, etc.), simply apply the same sequence of transformations to the same population of bootstrapped samples. Because all of these samples have "used up" the same researcher degrees of freedom (in the same order and same way), the confidence intervals derived from them should be more accurate. This alternative approach makes sense to me, though I haven't tested it. One of its side benefits is that the code itself explicitly documents all researcher degrees of freedom that were used.