On the scalability of statistical procedures: why the p-value bashers just don't get it.

Executive Summary

  1. The problem is not p-values it is a fundamental shortage of data analytic skill.
  2. In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.
  3. The long term solution is to require training in both statistics and data analysis for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.
  4. The Johns Hopkins Specialization in Data Science runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.

Scalability of Statistical Procedures

The P-value is in the news again. Nature came out with a piece talking about how scientists are naive about the use of P-values among other things. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google "NHST". Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis. 

Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven't adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment. Many of the people in positions of authority were trained before data were widely available and used.

The result is that the vast majority of people performing statistical and data analysis are people with only one or two statistics classes and little formal data analytic training under their belt. Many of these scientists would happily work with a statistician, but as any applied statistician at a research university will tell you, it is impossible to keep up with the demand from our scientific colleagues. Everyone is collecting major data sets or analyzing public data sets; there just aren't enough hours in the day.

Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many "researcher degrees of freedom". If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.

The short-term solution is to find a balance between researcher degrees of freedom and "recipe book" style approaches that require a specific method to be applied. In general, for naive analysts, it makes sense to lean toward less flexible methods that have been shown to work across a range of settings. The key idea here is to evaluate methods in the hands of naive users and see which ones work best most frequently, an idea we have previously called "evidence based data analysis".

An incredible success story of evidence based data analysis in genomics is the use of the limma package for differential expression analysis of microarray data. Limma can be beat in certain specific scenarios, but it is robust to such a wide number of study designs, sample sizes, and data types that the choice to use something other than limma should only be exercised by experts.

The trouble with criticizing p-values without an alternative

P-values are an obvious target of wrath by people who don't do day to day statistical analysis because the P-value is the most successful statistical procedure ever invented. If every person who used a P-value cited the inventor, P-values would have, very conservatively3 million citations. That's an insane amount of use for one statistic.

Why would such a terrible statistic be used by so many people? The reason is that it is critical that we have some measure of uncertainty we can assign to data analytic results. Without such a measure, the only way to determine if results are real or not is to rely on people's intuition, which is a notoriously unreliable metric when uncertainty is involved. It is pretty clear science would be much worse off if we decided if results were reliable based on peoples' gut feeling about the data.

P-values can and are misinterpreted, misused, and abused both by naive analysts and by statisticians. Sometimes these problems are due to statistical naiveté, sometimes they are due to wishful thinking and career pressure, and sometimes they are malicious. The reason is that P-values are complicated and require training to understand.

Critics of the P-value argue in favor of a large number of the procedures to be used in place of P-values. But when considering the scale at which the methods must be used to address the demands of the current data rich world, many alternatives would result in similar flaws. This is in no way proves the use of P-values is a good idea, but it does prove that coming up with an alternative is hard. Here are a few potential alternatives.

  1. Methods should only be chosen and applied by true data analytic experts. Pros: This is the best case scenario. Cons: Impossible to implement broadly given the level of statistical and data analytic expertise in the community 
  2. The full prior, likelihood and posterior should be detailed and complete sensitivity analysis should be performed. Pros: In cases where this can be done this provides much more information about the model and uncertainty being considered. Cons: The model requires more advanced statistical expertise, is computationally much more demanding, and can not be applied in problems where model based approaches have not been developed. Yes/no decisions about credibility of results still come down to picking a threshold or allowing more researcher degrees of freedom.
  3. A direct Bayesian approach should be used reporting credible intervals and Bayes estimators. Pros: In cases where the model can be fit, can be used by non-experts, provides scientific measures of uncertainty like confidence intervals. Cons: The prior allows a large number of degrees of freedom when not used by an expert, sensitivity analysis is required to determine the effect of the prior, many more complex models can not be implemented, results are still sample size dependent.
  4. Replace P-values with likelihood ratios. Pros: In cases where it is available may reduce some of the conceptual difficulty with the null hypothesis. Cons: Likelihood ratios can usually only be computed exactly for cases with few or no nuisance parameters, likelihood ratios run into trouble for complex alternatives, they are still sample size dependent, the a likelihood ratio threshold is equivalent to a p-value threshold in many cases.
  5. We should use Confidence Intervals exclusively in the place of p-values.  Pros: A measure and variability on the scale of interest will be reported. We can evaluate effect sizes on a scientific scale.  Cons: Confidence intervals are still sample size dependent and can be misleading for large samples, significance levels can be chosen to make intervals artificially wide/small, if used as a decision making tool there is a one-to-one mapping between a confidence interval and a p-value threshold.
  6. We should use Bayes Factors instead of p-values. Pros: They can compare the evidence (loosely defined) for both the null and alternative. They can incorporate prior information. Cons: Priors provide researcher degrees of freedom, cutoffs may still lead to false/true positives, BF's still depend on sample size.

This is not to say that many of these methods have advantages over P-values. But at scale any of these methods will be prone to abuse, misinterpretation and error. For example, none of them by default deals with multiple testing. Reducing researcher degrees of freedom is good when dealing with a lack of training, but the consequence is potential for mistakes and all of these methods would be ferociously criticized if used as frequently as p-values.

The difference between data analysis and statistics

Many disciplines including medicine and molecular biology usually require an introductory statistics or machine learning class during their program. This is a great start, but is not sufficient for the modern data saturated era. The introductory statistics or machine learning class is enough to teach someone the language of data analysis, but not how to use it. For example, you learn about the t-statistic and how to calculate it. You may also learn the asymptotic properties of the statistic. But you rarely learn about what happens to the t-statistic when there is an unmeasured confounder. You also don't learn how to handle non iid data, sample mixups, reproducibility, most of scripting, etc.

It is therefore critical that if you plan to use or understand data analysis you take both the introductory course and at least one data analysis course. The data analysis course should cover study design, more general data analytic reasoning, non-iid data, biased sampling, basics of non-parametrics, training vs test sets, prediction error, sources of likely problems in data sets (like sample mixups), and reproducibility. These are the concepts that appear regularly when analyzing real data that don't usually appear in the first course in statistics that most medical and molecular biology professionals see. There are awesome statistical educators who are trying hard to bring more of this into the introductory stats world, but it is just too much to cram into one class.

What should we do

The thing that is the most frustrating about the frequent and loud criticisms of P-values is that they usually point out what is wrong with P-values, but don't suggest what we should do about it.  When they do make suggestions, they frequently ignore the fundamental problems:

  1. Statistics are complicated and require careful training to understand properly. This is true regardless of the choice of statistic, philosophy, or algorithm.
  2. Data is incredibly abundant in all disciplines and shows no sign of slowing down.
  3. There is a fundamental shortage of training in statistics and data analysis 
  4. Giving untrained analysts extra researcher degrees of freedom is dangerous.

The most direct solution to this problem is increased training in statistics and data analysis. Every major or program in a discipline that regularly analyzes data (molecular biology, medicine, finance, economics, astrophysics, etc.) should require at minimum an introductory statistics class and a data analysis class. If the expertise doesn't exist to create these sorts of courses there are options. For example, we have introduced a series of 9 courses that run every month that cover most of the basic topics that are common across disciplines.

http://jhudatascience.org/

https://www.coursera.org/specialization/jhudatascience/1

I think of particular interest given the NIH Director's recent comments on reproducibility is our course on Reproducible Research. There are also many more specialized resources that are very good and widely available that will build on the base we created with the data science specialization.

  1. For scientific software engineering/reproducibility: Software Carpentry.
  2. For data analysis in genomics: Rafa's Data Analysis for Genomics Class.
  3. For Python and computing: The Fundamentals of Computing Specialization

Enforcing education and practice in data analysis is the only way to resolve the problems that people usually attribute to P-values. In the short term, we should at minimum require all the editors of journals who regularly handle data analysis to show competency in statistics and data analysis.

Correction: After seeing Katie K.'s comment on Facebook I concur that P-values were not directly referred to as "worse than useless", so to more fairly represent the article, I have deleted that sentence.

This entry was posted in Uncategorized. Bookmark the permalink.
  • James

    This is not about "untrained" people using p-values. You misuse them as well. You and very other intelligent people. If every time people show that scientists and statisticians are getting it wrong people blame it on the "poor" trained scientists and statisticians, we will take a long time to correct this.

    • http://www.gwern.net/ gwern

      I'm struck by how much OP reminds me of old defenses of C/manual-memory-management.

      Due to historical limitations on computers, manual memory management was necessary to run any programs at all, and GC was confined to limited marginal domains, often mocked by practical realistic programmers, who would chortle over how 'Lisp programmers know the value of everything and the cost of nothing'. The historical predominance of manual memory management was reflected in the available languages and concepts.

      Proponents of garbage collection could point to the greater theoretical integrity of GCed languages, the greater correctness of written programs, and observe how year after year, without fail, programs written in C would suffer buffer overflows, bugs due to manual memory management, and security flaws whose collective costs were best measured in billions of dollars; and the manual memory management fans would say that all the problems were due to bad or poorly-educated programmers and if you were bothered by all your OSes and programs being regularly remote rootable that you should just get better programs, that it let you write faster programs, that GC wasn't appropriate for all circumstances, and so on.

      GC techniques improved; computers got faster; the costs of errors to society went up.

      I think we all know how *that* debate turned out.

  • 3rdMoment

    Could you say more about why option 5 (confidence intervals) is bad one? The cons don't really seem like cons to me. Why is it bad that CI are one-to-one with p-values? (I agree, they are another way of expressing the same thing, but one that highlights rather than obscures what is usually important.) And why are CI misleading in large samples? (If anything, the main problems I see come in very small samples in cases where you don't have a pivotal statistic, so you can't really form a proper CI.)

    I agree with a lot of your points: any statistical procedure or concept can and will be misused by the ignorant masses of researchers (not to mention journalists and the public when they try to interpret the research.)

    But I also agree with critics that, in the present circumstances, *by far* the most common errors are those associated with NHST, including misuse of p-values. So while some of the criticisms are overstated, I think it makes sense to call people's attention to these most-common errors, even if we know that will not eliminate errors and may even lead to some different classes of errors.

    • John Muschelli

      I think in some respects, the one-to-one correspondence isn't a bad thing, but if you're still using thresholds (of inclusion of 0/1/etc) for decision making, using confidence intervals over p-values will have the same conclusion for the same α level. That said, CIs allow you to see if an interval overlaps a value of a null hypothesis while also showing you the size of the effect, which is a good thing. So if you're using CIs with a testing idea in mind and seeing overlap at a certain level you're implicitly using p-values so there is no "pro" in that.

  • James

    "The reason is that it is critical that we have some measure of uncertainty ".

    P-values do not measure uncertainty about a hypothesis, it measures the unlikeliness of the results assuming that the hypothesis is true.

    If H0 is "a mutant pink monkey created this data like this", the pvalue is 100% but it does not change the uncertainty about H0 at all (which is: we are sure it is false).

    • Neo Chung

      Constructing a correct and sensible null hypothesis (that allows you to measure "uncertainty of your interest") is exactly what a good course on statistics should teach. Of course, no amount of philosophy (Bayesian or otherwise) could prevent one's intention to abuse statistical methods.

  • Entsophy

    There's no hope of progress here. Every time a theoretical or practical failure of p-values is found, Frequentists never blame their philosophy. They just claim (basically without evidence) that the source of the problem is the insufficient devotion to frequentism on the part of the user. In their minds if we were just a little more devoted to their philosophy we'd reach that happy point where "small p-values"="highly reliable".
    It reminds me of Communists who would deny that communism had ever failed, even when it was collapsing around them, and would just claim that "real communism hasn't been tried yet"

  • Vincent Granville

    I don't use P-values. Clients and colleagues don't understand what it means. Too difficult to explain. I created my own metrics to communicate with them:

    - Confidence intervals without stats: http://www.analyticbridge.com/profiles/blogs/how-to-build-simple-accurate-data-driven-model-free-confidence-in

    - Predictive power: http://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power

  • Paul Lawrence Hayes

    “Why would such a terrible statistic be used by so many people?”

    For many of the same reasons homeopathy is used by so many people.

    One of which being that it's defended as vigorously and plausibly but fallaciously by homeopaths as statisticians defend their indefensibles.

  • Rasmus Arnling Bååth

    Why can't p-values and a fundamental shortage of data analytic skill both be problems? P-values are bad because (among other things) they make people focus on whether there is a difference (which is not the interesting question 95% of the time) rather than focusing on how large a difference there is. No amount of education can fix this problem with p-values, right? Decent alternatives are standard errors and bootstrap procedures which both approximate credible intervals under many circumstances. If p-values are not taught in beginning statistics courses there would be more time left for learning data analytics skills :)

  • XTC

    While I agree that statistical literacy should be a goal for nearly all academic disciplines, that's a much harder objective than for stats programs to take on the challenge of teaching jargon-ridden and glossolalic statisticians to communicate with the nontechnical. It's sort of like teaching residents in medicine bedside manners not to mention that it's a whole lot easier to implement.

  • Deborah Mayo

    So-called NHST, wherein a statistically significant result entitles moving to an alternative about a hypothesis that "explains" or entails the stat sig result, is a made-up animal lampooned by every one of the founders of significance testing. (On the other hand, statistical affirming the consequent will give you a Bayes boost.) But the general error-statistical reasoning underlying the use of p-values is essential for reliable statistical inquiry. The abusers are guilty of questionable science. It's interesting to see how error statistical tools are appealed to even by those who claim to reject its rationale when they wish to test models, or, for that matter, to chime in with other people's fraud busting.

    http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/

  • Rae

    If the method is flawed, no matter how well you are trained at using it, your results will not answer the question that you are seeking. [I just hope that by learning more about what the p-value really means, people will eventually realize how stupid it is to still use it, despite the decades of justified criticism]

  • Jean Wu

    If one has problem with people misunderstanding p-value, or abusing p-value, they should not blame the p-value itself but those who misinterpret it. ---- Hmm, do I sound like National Rifle Association talking about guns? Maybe we should have a statistical background check before anybody uses or talks about p-values!

  • Jay

    This is an excellent article overall.

    However, this sentence does not suit you:

    " If every person who used a P-value cited the inventor, P-values would have, very conservatively, 3 million citations. That's an insane amount of use for one statistic."

    This is an argument to popularity, "Argumentum ad populum" (because it sounds more formal in Latin I guess).

    You ask: "Why would such a terrible statistic be used by so many people? "

    And answer thusly: "The reason is that it is critical that we have some measure of uncertainty we can assign to data analytic results. "

    Let me offer an alternative hypothesis, in your own words: " The problem is [...] that the vast majority of data analysis is not performed by people properly trained to perform data analysis. "

    The reason so many people use the p-value is because it's what is taught in the first and only statistics course they ever took and they don't know anything better.

    It's very good that Nature is pointing out the limitations. They are useful but misconceptions abound. Nature is doing their part to clear some of these misconceptions up. They are not "bashing it"

    Loved the analysis of alternatives.

  • Jean Wu

    BTW, a student asked me last week what's so special about 0.05. I told them it's because we have 5 fingers on each hand, and 10 fingers total, hence 5/10^2. According to this theory, in the world of Simpsons, the classical significance level would be 4/8^2=0.0625

  • Chris Harvey

    The use of NHST as a means to, from a small sample, extrapolate to the world at large is minimally contentious and surely unresolved.

    If researcher degrees of freedom can be accounted for, and claims made from NHST can actually be tested, in the world, i.e. clinical trials, then they are potentially falsifiable and should potentially be considered as statements of scientific import. If such claims made from NHST are made in an environment where researcher degrees of freedom are uncontrollable, not theoretically but in reality, then the resulting scientific claims are most likely unfalsifiable. This is problematic. I don't have the answer, clearly, but I thought to chime in because the notion that we need to just be more rigorous with our application of p-values seems to miss the point. But I often miss the point, so it could just be me.

    The following is a light-hearted read on conflicting interpretations of probability.
    http://www.amazon.com/Interpreting-Probability-Controversies-Developments-Twentieth/dp/0521812518

  • Jari Niemi

    > The problem is not p-values it is a fundamental shortage of data analytic skill.

    “Very few statisticians have been studying information theory, the result of which, I think, is the disarray of the present discipline of statistics.” Jorma Rissanen (page 2) (quote from here: http://www.amazon.com/gp/product/1107004748/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1107004748&linkCode=as2&tag=chrprobboo-20 )

    > Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods.

    One may perform data analysis without using statistics of or being statisticians. Performing data analysis requires different methods from different science fields.
    Data analysis IS NOT statistics!

    > The result is that the vast majority of people performing statistical and data analysis are people with only one or two statistics classes and little formal data analytic training under their belt.

    Again DATA ANALYSIS does not mean using statistics! One can do just fine data analysis without doing any statistics! This is not "my way or the highway"!

    For example, while analysing NGS data one needs to align the reads on the genome and the most used method is Burrow-Wheeler transform (which comes from Information Theory and Compression methods).

    Also there are newer and better methods out there for data analysis which replace successfully the need of using p-values (e.g. NML principle => see Rissanen's work)!

  • http://www.twentylys.com/ TwentyLYS

    If one work on the commercial industry, p-value probably be the mother of all solutions and only solution:

    Here are some of the reason:
    1. You got to finish analyzing this data in couple of days no matter what. Before even plotting the data, using the stepwise regression and get the p-value. There will be lot of discussion between forward or backward though!
    2. Your boss probably knows only p-value
    3. Client is waiting to see the same "what's the p-value", is it less than 0.5, boom, significant!
    4. You have another project to do the same

    List goes on and no stopping anywhere near.

    Being a Statistician, I see there is a massive lack of understanding about why proper data analysis needs time. Period!