Simply Statistics: P-values and hypothesis testing get a bad rap

This post written by Jeff Leek and Rafa Irizarry.

The p-value is the most widely-known statistic. P-values are reported in a large majority of scientific publications that measure and report data. R.A. Fisher is widely credited with inventing the p-value. If he was cited every time a p-value was reported his paper would have, at the very least, 3 million citations* - making it the most highly cited paper of all time.

However, the p-value has a large number of very vocal critics. The criticisms of p-values, and hypothesis testing more generally, range from philosophical to practical. There are even entire websites dedicated to “debunking” p-values! One issue many statisticians raise with p-values is that they are easily misinterpreted, another is that p-values are not calibrated by sample size, another is that it ignores existing information or knowledge about the parameter in question, and yet another is that very significant (small) p-values may result even when the value of the parameter of interest is scientifically uninteresting.

We agree with all these criticisms. Yet, in practice, we find p-values useful and, if used correctly, a powerful tool for the advancement of science. The fact that many misinterpret the p-value is not the p-value’s fault. If the statement “under the null the chance of observing something this convincing is 0.65” is correct, then why not use it? Why not explain to our collaborator that the observation they thought was so convincing can easily happen by chance in a setting that is uninteresting. In cases where p-values are small enough then the substantive experts can help decide if the parameter of interest is scientifically interesting. In general, we find p-value to be superior to our collaborators intuition of what patterns are statistically interesting and which ones are not.

We also find p-values provide a simple way to construct decision algorithms. For example, a government agency can define general rules based on p-values that are applied equally to products needing a specific seal of approval. If the rule proves to be to lenient or restrictive, we change the p-value cut-off appropriately. In this situation we view the p-value as part of a practical protocol, not a tool for statistical inference.

Moreover the p-value has the following useful properties for applied statisticians:

p-values are easy to calculate, even for complicated statistics. Many statistics do not lend themselves to easy analytic calculation; but using permutation and bootstrap procedures p-values can be calculated even for very complicated statistics.
p-values are relatively easy to understand. The statistical interpretation of the p-value remains roughly the same no matter how complicated the underlying statistic and they also bounded between 0 and 1. This also means that p-values are easy to mis-interpret - they are not posterior probabilities. But this is a difficulty with education, not a difficulty with the statistic itself.
p-values have simple, universal properties Correct p-values are uniformly distributed under the null, regardless of how complicated the underlying statistic.
p-values are calibrated to error rates scientists care about Regardless of the underlying statistic, calling all P-values less than 0.05 significant leads to on average about 5% false positives even if the null hypothesis is always true. If this property is ignored things like publication bias can result, but again this is a problem with education and the scientific process, not with p-values.
p-values are useful for multiple testing correction. The advent of new measurement technology has shifted much of science from hypothesis driven to discovery driven making the existing multiple testing machinery useful. Using the simple, universal properties of p-values it is possible to easily calculate estimates of quantities like the false discovery rate - the rate at which discovered associations are false.
p-values are reproducible. All statistics are reproducible with enough information. Given the simplicity of calculating p-values, it is relatively easy to communicate sufficient information to reproduce them.

We agree there are flaws with p-values, just like there are with any statistic one might choose to calculate. In particular, we do think that confidence intervals should be reported with p-values when possible. But we believe that any other decision-making statistic would lead to other problems. One thing we are sure about is that p-values beat scientists’ intuition about chance any day. So before bashing p-values too much we should be careful because, like democracy to government, p-values may be the worst form of statistical significance calculation except all those other forms that have been tried from time to time.

————————————————————————————————————

* Calculated using Google Scholar using the formula:

Number of P-value Citations = # of papers with exact phrase “P < 0.05” + (# of papers with exact phrase “P < 0.01” and not exact phrase “P < 0.05”) + (# of papers with exact phrase “P < 0.001” and not exact phrase “P < 0.05” or “P < 0.001”)

= 1,320,000 + 1,030,000 + 662,500

This is obviously an extremely conservative estimate.

P-values and hypothesis testing get a bad rap - but we sometimes find them useful.