A menagerie of messed up data analyses and how to avoid them

Jeff Leek

Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.

In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).



Outcodirection411me switching

_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.

An example: In this article they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.

What you can do: Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF’s pre-registration challenge.


Garden of forking paths

_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”

An exampleThis article gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.

_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.



_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.

An example: This one gets talked about a lot and there is some evidence that it happens. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.

What to do: Know how to do an analysis well and don’t cheat.

Update:  Some Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor. “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out on Twitter. It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.


paypal15Uncorrected multiple testing 

_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.

An example:  The most famous example is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.

What you can do: Correct for multiple testing. When you calculate a large number of p-values make sure you know what their distribution is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.


animal162I got a big one here

What it is: One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.

An example: In a paper authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor. a large fraction of these differences.

What you can do: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend a lot of time trying to figure out why it could be a mistake. If you don’t, others definitely will, and you might be embarrassed.

man298Double complication

What it is: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.

An example: There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people tried to use complicated prediction algorithms to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.

What you can do: When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets even simple things work.






Image credits: