16
Dec

A summary of the evidence that most published research is false

Tweet about this on Twitter396Share on Facebook374Share on Google+82Share on LinkedIn20Email this to someone

One of the hottest topics in science has two main conclusions:

  • Most published research is false
  • There is a reproducibility crisis in science

The first claim is often stated in a slightly different way: that most results of scientific experiments do not replicate. I recently got caught up in this debate and I frequently get asked about it.

So I thought I'd do a very brief review of the reported evidence for the two perceived crises. An important point is all of the scientists below have made the best effort they can to tackle a fairly complicated problem and this is early days in the study of science-wise false discovery rates. But the take home message is that there is currently no definitive evidence one way or another about whether most results are false.

  1. Paper: Why most published research findings are falseMain idea: People use hypothesis testing to determine if specific scientific discoveries are significant. This significance calculation is used as a screening mechanism in the scientific literature. Under assumptions about the way people perform these tests and report them it is possible to construct a universe where most published findings are false positive results. Important drawback: The paper contains no real data, it is purely based on conjecture and simulation.
  2. Paper: Drug development: Raise standards for preclinical researchMain ideaMany drugs fail when they move through the development process. Amgen scientists tried to replicate 53 high-profile basic research findings in cancer and could only replicate 6. Important drawback: This is not a scientific paper. The study design, replication attempts, selected studies, and the statistical methods to define "replicate" are not defined. No data is available or provided.
  3. Paper: An estimate of the science-wise false discovery rate and application to the top medical literatureMain idea: The paper collects P-values from published abstracts of papers in the medical literature and uses a statistical method to estimate the false discovery rate proposed in paper 1 above. Important drawback: The paper only collected data from major medical journals and the abstracts. P-values can be manipulated in many ways that could call into question the statistical results in the paper.
  4. Paper: Revised standards for statistical evidenceMain idea: The P-value cutoff of 0.05 is used by many journals to determine statistical significance. This paper proposes an alternative method for screening hypotheses based on Bayes factors. Important drawback: The paper is a theoretical and philosophical argument for simple hypothesis tests. The data analysis recalculates Bayes factors for reported t-statistics and plots the Bayes factor versus the t-test then makes an argument for why one is better than the other.
  5. Paper: Contradicted and initially stronger effects in highly cited research Main idea: This paper looks at studies that attempted to answer the same scientific question where the second study had a larger sample size or more robust (e.g. randomized trial) study design. Some effects reported in the second study do not match the results exactly from the first. Important drawback: The title does not match the results. 16% of studies were contradicted (meaning effect in a different direction). 16% reported smaller effect size, 44% were replicated and 24% were unchallenged. So 44% + 24% + 16% = 86% were not contradicted. Lack of replication is also not proof of error.
  6. PaperModeling the effects of subjective and objective decision making in scientific peer reviewMain idea: This paper considers a theoretical model for how referees of scientific papers may behave socially. They use simulations to point out how an effect called "herding" (basically peer-mimicking) may lead to biases in the review process. Important drawback: The model makes major simplifying assumptions about human behavior and supports these conclusions entirely with simulation. No data is presented.
  7. Paper: Repeatability of published microarray gene expression analysesMain idea: This paper attempts to collect the data used in published papers and to repeat one randomly selected analysis from the paper. For many of the papers the data was either not available or available in a format that made it difficult/impossible to repeat the analysis performed in the original paper. The types of software used were also not clear. Important drawbackThis paper was written about 18 data sets in 2005-2006. This is both early in the era of reproducibility and not comprehensive in any way. This says nothing about the rate of false discoveries in the medical literature but does speak to the reproducibility of genomics experiments 10 years ago.
  8. Paper: Investigating variation in replicability: The "Many Labs" replication project. (not yet published) Main ideaThe idea is to take a bunch of published high-profile results and try to get multiple labs to replicate the results. They successfully replicated 10 out of 13 results and the distribution of results you see is about what you'd expect (see embedded figure below). Important drawback: The paper isn't published yet and it only covers 13 experiments. That being said, this is by far the strongest, most comprehensive, and most reproducible analysis of replication among all the papers surveyed here.

I do think that the reviewed papers are important contributions because they draw attention to real concerns about the modern scientific process. Namely

  • We need more statistical literacy
  • We need more computational literacy
  • We need to require code be published
  • We need mechanisms of peer review that deal with code
  • We need a culture that doesn't use reproducibility as a weapon
  • We need increased transparency in review and evaluation of papers

Some of these have simple fixes (more statistics courses, publishing code) some are much, much harder (changing publication/review culture).

The Many Labs project (Paper 8) points out that statistical research is proceeding in a fairly reasonable fashion. Some effects are overestimated in individual studies, some are underestimated, and some are just about right. Regardless, no single study should stand alone as the last word about an important scientific issue. It obviously won't be possible to replicate every study as intensely as those in the Many Labs project, but this is a reassuring piece of evidence that things aren't as bad as some paper titles and headlines may make it seem.

Many labs data. Blue x's are original effect sizes. Other dots are effect sizes from replication experiments (http://rolfzwaan.blogspot.com/2013/11/what-can-we-learn-from-many-labs.html)

The Many Labs results suggest that the hype about the failures of science are, at the very least, premature. I think an equally important idea is that science has pretty much always worked with some number of false positive and irreplicable studies. This was beautifully described by Jared Horvath in this blog post from the Economist.  I think the take home message is that regardless of the rate of false discoveries, the scientific process has led to amazing and life-altering discoveries.

  • http://www.refsmmat.com/ Alex Reinhart

    I think the most amusing example would be Schoenfeld and Ioannidis's paper "Is everything we eat associated with cancer? A systematic cookbook review":

    http://ajcn.nutrition.org/content/97/1/127

    They show that, at least in the field of nutritional epidemiology (which produces articles which are bound to be regurgitated and oversimplified on TV news), contradictory and weak results are very common. 36 of the 40 ingredients for which they found studies had results showing both higher and lower cancer risks. Meta-analysis showed most of the risks were almost nonexistent.

  • Noah Simon

    I think there is a strong argument to be made that much of the issue is selection bias in effect-size estimation. I tend to use frequentist over bayesian methods, but something many frequentists do very poorly is estimating effect sizes in high throughput experiments. They (/we) tend to adjust for multiplicity in testing using FDR or whatnot quite reasonably, and then once we decide which genes are significant, just report unadjusted effect-size estimates! This is really bad, and can lead to a huge amount of bias! I suppose this isn't necessarily a discussion of whether the effects are "true" or not, but certainly their scientific significance is uniformly overstated when using uncorrected estimates.

    I just arxived a paper on this which I think is pretty relevant and intuitive on this:

    http://arxiv.org/abs/1311.3709

    I think it's an important, interesting, and underexplored issue.

  • openbrain

    Here's one you're missing: Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: A problem of significance. Nature Neuroscience, 14, 1105-1107.

  • jshoyer

    What do you mean by "use reproducibility as a weapon"? Something similar to Roger's post? http://simplystatistics.org/2013/04/30/reproducibility-and-reciprocity/

  • AnneWaller57

    The dismissal of item #2 on your list seems particularly weak to me. This was a highly credible report in Nature by a serious guy (Begley) who ran a major biotech's research activities for many years. They sometimes tried as many as 50 times to replicate particular results. Of course it wasn't a random sample of studies they tried to replicate--it was important findings in preclinical oncology. What he said was also echoed by people from Merck. You think he is going to make this stuff up, when everyone he has ever worked with is going to read it?

    You lose a lot of credibility when you say that if this company won't publish all the details--which would definitely not be in the interest of their stockholders and which they simply cannot do for many reasons--then you are going to whistle real loud and ignore what the man is saying. I think anyone of common sense is going to realize that he is describing something real and you are pretending you can make it go away for...whatever little rhetorical point you are trying to make here, which was not very clear.

    • jtleek

      This is called an argument from authority (http://en.wikipedia.org/wiki/Argument_from_authority). You say Begley is very serious, that is fine. But to claim that that paper was a scientific paper is factually inaccurate. You must report data, methods, and results to be a scientific paper. They reported none of these, just an unsupported claim.

      I am perfectly comfortable with my statement and my credibility.

      • David Atkins

        Your link does not prove what you claim it proves.

        Credibility? We'll be the judges of that.

    • Roger Peng

      It's worth noting that the Begley/Ellis report was published in the Comment section of Nature. Why? Because even Nature can occasionally tell the difference between an actual research article and a commentary. The point is we should hold everyone to the same standard of scientific conduct. Begley and Ellis may have reported something interesting, but it wasn't science. In particular, their report is not reproducible in any meaningful way. They might have well have said, "We saw some UFOs, but we can't give you the details."

      • http://www.gwern.net/ gwern

        The fact that it is irreproducible because they had to sign NDAs, IIRC, seems as damning as the results they claimed to reach.

      • gagz

        I don't think Begley's comments in the post were to be taken without evidence. In fact, I got the opinion that himself and William Gunn are involved in the same study. From piecing the two's comments together, it seems they are giving ample time for the "accused" (but not yet named) to reproduce the results being questioned.

        Maybe I'm wrong, but again it seems that the two are involved in some sort of covert sting in which they're now going to the perpetrators and saying "we'll give you ample time, and your own reagents to produce the results in this article" (i say this because their comments suggest this phase is still in process).

        In short: It is reckless to accuse Begley of making concrete claims as he's not named parties or studies, and in fact seems to be giving these said groups some outs.

  • Vladimir Morozov

    You might consider add this attempt http://blogs.nature.com/news/2011/09/reliability_of_new_drug_target.html
    to replicate biomedical publications into your list

    My colleagues tried to re-test multiple compounds reported to
    be beneficial for survival in the ALS animal model and failed to find any
    effect.
    http://scholar.google.com/scholar?cluster=15616302132999277464&hl=en&as_sdt=40000005&sciodt=0,22&as_ylo=2013

    Though I agree with you that such replication studies should meet replication criteria themselves by making data and analytical code freely available.

  • Deborah Mayo

    Thanks for the highly informative and clear synthesis.I've tried to get to the bottom of these different avenues of attack,and found it difficult*--so this helps a great deal. (On one of the comments: although Begley and Ellis found some gross errors, the upshot was to show the value of good design and simple steps such as blindness.)
    *One example: http://errorstatistics.com/2013/11/09/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-i/

  • RIP1234

    2005 = "early in the era of reproducibility" What does that even mean???

  • http://isomorphismes.tumblr.com/ isomorphisms

    Thanks for the valuable summary Jeff.

  • ap

    Your summary of the Bayes factor paper is misleading. The author also suggests changing the p value cutoff to 0.005 and not just reporting the Bayes factor. The Bayes factor should basically accompany this new change p value cutoff

  • http://www.twentylys.com/ TwentyLYS

    We need more statistical literacy
    We need more computational literacy
    We need to require code be published
    We need mechanisms of peer review that deal with code
    We need a culture that doesn't use reproducibility as a weapon
    We need increased transparency in review and evaluation of papers
    Great points! As I have seen people who studied Stat 101 leading Statistical Analysis projects and paper itself. No wonder it generates over/under estimation and sometimes hardly any estimation at all!!

    The first point striked the heart of the issues!

    Great job Jeff. There may be lot of discussion about some of the listed points for sure.

  • Daneel_Olivaw

    Very interesting and hopeful article.
    I've just found your blog searching for researcher degrees of freedom and I'm adding it to my rss feeds :D

  • salustri

    One important feature that, I think, is not given here is that these alleged problems of statistical significance appear to be limited to the health & biological sciences, and excludes the rest of science. So to say that "most published research is false" is an overgeneralization.

  • http://www.jackauty.com Jack Auty

    This article kind of reminds me of Ben Goldacre talking about a Funnel Plot that indicated that there was positive publication bias occurring in studies which used Funnel Plots to point out that positive publication bias was occurring.
    This article is criticizing other articles for being theoretical, not reporting data and not being a peer reviewed scientific study, whilst itself being theoretical, not reporting data and not being a peer reviewed scientific study.