Statistical vitriol

Jeff Leek
2016-09-29

Over the last few months there has been a lot of vitriol around statistical ideas. First there were data parasites and then there were methodological terrorists. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.

I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics - which I’m so excited about and have spent my entire professional career working on - is something that is causing so much frustration, anxiety, and anger.

I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.

  1. Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.
  2. The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.
  3. Most senior scientists, the ones leading and designing studies, have little or no training in statistics. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.
  4. There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. Statisticians with some control over their environment (think senior tenured professors of (bio)statistics) can avoid these imbalances and look for collaborators who respect statistical thinking, but not everyone can. There are a large number of lonely bioinformaticians out there.
  5. Statisticians and computational scientists are also frustrated because their is often no outlet for them to respond to these papers in the formal scientific literature - those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.

Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shakey statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).

This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an internet scientist now. So the statisticians and statistically savvy take to blogs, f1000research, and other outlets to get their point across.

Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.

Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap - talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. I am embarrassed I did that now, it wasn’t constructive and the author ended up being very responsive. I think if I had just emailed that person they would have resolved their problem.

I just recently had an experience where a very prominent paper hadn’t made their data public and I was having trouble getting the data. I thought about writing a blog post to get attention, but at the end of the day just did the work of emailing the authors, explaining myself over and over and finally getting the data from them. The result is the same (I have the data) but it cost me time and frustration. So I understand when people don’t want to deal with that.

The problem is that scientists see the attention the statisticians are calling down on them - primarily negative and often over-hyped. Then they get upset and call the statisticians/open scientists names, or push back on entirely sensible policies because they are worried about being humiliated or discredited. While I don’t agree with that response, I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good.

So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:

  1. We should create continuining education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.
  2. We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.
  3. We should recognize that scientists and data generators have a lot more on the line when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.
  4. We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard - scientists careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.
  5. We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.
  6. Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.

I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.

I’d be eager to hear other ideas too. It makes me sad that statistics has become so political on both sides.