01
Aug

The ROC curves of science

Andrew Gelman's recent post on what he calls the "scientific mass production of spurious statistical significance" reminded me of a thought I had back when I read John Ioannidis' paper claiming that most published research finding are false. Many authors, which I will refer to as the pessimists, have joined Ioannidis in making similar claims and repeatedly blaming the current state of affairs on the mindless use of frequentist inference. The gist of my thought is that, for some scientific fields, the pessimist's criticism is missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society. Before I explain this in more detail, I want to be very clear that I do think that reducing false discoveries is an important endeavor and that some of these false discoveries are completely avoidable. But, as I describe below, a general solution that improves the current situation is much more complicated than simply abandoning the frequentist inference that currently dominates.

Few will deny that our current system, with all its flaws, still produces important discoveries. Many of the pessimists' proposals for reducing false positives seem to be, in one way or another, a call for being more conservative in reporting findings. Example of recommendations include that we require larger effect sizes or smaller p-values, that we correct for the "researcher degrees of freedom", and that we use Bayesian analyses with pessimistic priors. I tend to agree with many of these recommendations but I have yet to see a specific proposal on exactly how conservative we should be. Note that we could easily bring the false positives all the way down to 0 by simply taking this recommendation to its extreme and stop publishing biomedical research results all together. This absurd proposal brings me to receiver operating characteristic (ROC) curves.

Slide1

ROC curves plot true positive rates (TPR) versus false positive rates (FPR) for a given classifying procedure. For example, suppose a regulatory agency that runs randomized trials on drugs (e.g. FDA) classifies a drug as effective when a pre-determined statistical test produces a p-value < 0.05 or a posterior probability > 0.95. This procedure will have a historical false positive rate and true positive rate pair: one point in an ROC curve. We can change the 0.05 to, say, 0.2 (or the 0.95 to 0.80) and we would move up the ROC curve: higher FPR and TPR. Not doing research would put us at the useless bottom left corner. It is important to keep in mind that biomedical science is done by imperfect humans on imperfect and stochastic measurements so to make discoveries the field has to tolerate some false discoveries (ROC curves don't shoot straight up from 0% to 100%). Also note that it can take years to figure out which publications report important true discoveries.

I am going to use the concept of ROC curve to distinguish between reducing FPR by being statistically more conservative and reducing FPR via more general improvements.  In my ROC curve the y-axis represents the number of important discoveries per decade and the x-axis the number of false positives per decade (to avoid confusion I will continue to use the acronyms TPR and FPR). The current state of biomedical research is represented by one point on the red curve: one TPR,FPR pair. The pessimist argue that the FPR is close to 100% of all results but they rarely comment on the TPR. Being more conservative lowers our FPR, which saves us time and money, but it also lowers our TPR, which could reduce the number of important discoveries that improve human health. So what is the optimal balance and how far are we from it? I don't think this is an easy question to answer.

Now, one thing we can all agree on is that moving the ROC curve up is a good thing, since it means that we get a higher TPR for any given FPR. Examples of ways we can achieve this are developing better measurement technologies, statistically improving the quality of these measurements, augmenting the statistical training of researchers, thinking harder about the hypotheses we test, and making less coding or experimental mistakes. However, applying a more conservative procedure does not move the ROC up, it moves our point left on the existing ROC: we reduce our FPR but reduce our TPR as well.

In the plot above I draw two imagined ROC curves: one for physics and one for biomedical research. The physicists' curve looks great. Note that it shoots up really fast which means they can make most available discoveries with very few false positives. Perhaps due to the maturity of the field, physicists can afford and tend to use very stringent criteria. The biomedical research curve does not look as good. This is mainly due to the fact that biology is way more complex and harder to model mathematically than physics. However, because there is a larger uncharted territory and more research funding, I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher TPR, biomedical research has to tolerate a higher FPR. According to my imaginary ROC curves, if we become as stringent as physicists our TPR would be five times smaller. It is not obvious to me that this would result in a better situation than the current one. At the same time, note that the red ROC suggests that increasing the FPR, with the hopes of increasing our TPR, is not a good idea because the curve is quite flat beyond our current location on the curve.

Clearly I am oversimplifying a very complicated issue, but I think it is important to point out that there are two discussions to be had: 1) where should we be on the ROC curve (keeping in mind the relationship between FPR and TPR)? and 2) what can we do to improve the ROC curve? My own view is that we can probably move down the ROC curve some and reduce the FPR without much loss in TPR (for example, by raising awareness of the researcher degrees of freedom). But I also think that most our efforts should go to reducing the FPR by improving the ROC. In general, I think statisticians can add to the conversation about 1) while at the same time continue collaborating to move the red ROC curve up.

  • Frank Farach

    Thanks for an interesting post, Rafael. I agree that we need to have a discussion about the benefits and costs of these tradeoffs and how they interact with the publishing system. I would like to see us move more toward a post-publication peer review model like the F1000 journals. That model would give us two thresholds rather than one: the first is lenient, in that practically everything gets published with open access (OA); the second (peer-review) is community-driven and determines what gets picked up by indexing services such as PubMed. That way, important discoveries get out there (along with less important ones), but because peer-review still happens (in the open), it should be easier to distinguish the truly important discoveries far earlier than in our current system.

    The other big issue I see in many fields is that we simply don't know how many of our "important findings" are replicable. A finding that doesn't replicate well has reduced importance. Your point about ROCs stands, but there is definitely some noise on the Y-axis that can be reduced by incentivizing researchers to replicate studies with high potential for impact.

  • John Hogenesch

    One point I try to drive home with my trainees is that the table with p- or q-values isn't the end of the experiment, it's the beginning. If you take an observation, even a statistically spurious one, and by intuition or sheer luck you pick a gene/protein, do extensive experiments to prove that it's right, it really doesn't matter if the p-value from big data was too lax. In other words, the system may be poorly setup to consume tables of p-values at face value, but for those of us who like to follow up and do biology afterwards, it works out in the end.

  • http://www.refsmmat.com/ Alex Reinhart

    I think a statistical pessimist has a lot more to worry about than stringent statistical thresholds. I've been building a compilation of statistical errors for a while now, and there are a number of errors that affect numerous papers. They're the low-hanging fruit in the effort to improve the ROC curve:

    1. Never calculating statistical power or required sample sizes. You are allowed to turn right at a red light because the first studies analyzing the problem didn't have enough power to detect 60% greater rates of pedestrian injuries. Something like 64% of randomized controlled medical trials don't collect enough data to detect a 50% difference between treatment groups. You can't even tell where you are on the ROC curve if you never bother calculating your power.

    2. Gelman's "The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant" error.

    3. Physical scientists (such as, say, most climate scientists) judge whether two figures are significantly different by eyeballing the confidence intervals. This is a much stricter test than a t test, more equivalent to requiring p < 0.01. So they're incorrectly producing false negatives, not false positives.

    4. The combination of low power and publication bias means that the only published studies are those that got lucky and produced an exaggerated effect size. I think Ioannidis calls this "truth inflation."

    These aren't errors correctable with p value tinkering, like multiple comparisons could be. (Or the Bayesian complaints that p values exaggerate evidence compared to Bayes factors.) Statisticians need to step up to educate scientists on the use of procedures. The underlying issues in "Why Most Published Research FIndings are False" would still exist, but we'd dent the ROC curve significantly.

    • Peter Hickey

      Alex - Have you got a reference for "You are allowed to turn right at a red light because the first studies analyzing the problem didn't have enough power to detect 60% greater rates of pedestrian injuries"? Sounds interesting.

      • http://www.refsmmat.com/ Alex Reinhart

        Yeah, it's the first example given in this paper:

        Hauer, E. (2004). The harm done by tests of significance. Accident Analysis & Prevention, 36(3), 495–500. doi:10.1016/S0001-4575(03)00036-8

        • Peter Hickey

          Thanks!

  • Alex Whitworth

    I find this a similar argument to one made in Chapters 2 and 4 of "The Org," which I recommend, and is quoted below:

    distinction between “star” and “guardian” tasks,
    * safety, audit, and compliance departments are similarly given the unglamorous job of guarding against the catastrophic decisions of others.
    * Star performers, by contrast, do their jobs best when they’re swinging for the fences, not worrying about risk. If you’re hiring R&D scientists, better to have gamblers than worrywarts.
    * Any org has to have both stars and guardians, carefully balanced. When the guardians become too powerful, innovation grinds to a bureaucratic halt. When stars hold sway, sooner or later we end up with something like the financial crisis.

    In the end, most orgs find a middle ground. They cordon off some part of the organization, call it skunkworks, and put some checks and balances into unbridled innovation, stifling some creativity and initiative but ensuring that things don’t get out of hand.

    http://www.amazon.com/Org-Underlying-Logic-Office/dp/0446571598/ref=sr_1_1?s=books&ie=UTF8&qid=1375453476&sr=1-1&keywords=the+org%3A

  • Oscar Mier

    Excuse me: how do you know that the ROC curve is, necessarily, logarithmic?

  • Oscar Mier

    All important discoveries are true positives, but not all true positives are important discoveries. The label for the ordinates seems inappropriate.