This post written by Jeff Leek and Rafa Irizarry.
The p-value is the most widely-known statistic. P-values are reported in a large majority of scientific publications that measure and report data. R.A. Fisher is widely credited with inventing the p-value. If he was cited every time a p-value was reported his paper would have, at the very least, 3 million citations* - making it the most highly cited paper of all time.
However, the p-value has a large number of very vocal critics. The criticisms of p-values, and hypothesis testing more generally, range from philosophical to practical. There are even entire websites dedicated to “debunking” p-values! One issue many statisticians raise with p-values is that they are easily misinterpreted, another is that p-values are not calibrated by sample size, another is that it ignores existing information or knowledge about the parameter in question, and yet another is that very significant (small) p-values may result even when the value of the parameter of interest is scientifically uninteresting.
We agree with all these criticisms. Yet, in practice, we find p-values useful and, if used correctly, a powerful tool for the advancement of science. The fact that many misinterpret the p-value is not the p-value’s fault. If the statement “under the null the chance of observing something this convincing is 0.65” is correct, then why not use it? Why not explain to our collaborator that the observation they thought was so convincing can easily happen by chance in a setting that is uninteresting. In cases where p-values are small enough then the substantive experts can help decide if the parameter of interest is scientifically interesting. In general, we find p-value to be superior to our collaborators intuition of what patterns are statistically interesting and which ones are not.
We also find p-values provide a simple way to construct decision algorithms. For example, a government agency can define general rules based on p-values that are applied equally to products needing a specific seal of approval. If the rule proves to be to lenient or restrictive, we change the p-value cut-off appropriately. In this situation we view the p-value as part of a practical protocol, not a tool for statistical inference.
Moreover the p-value has the following useful properties for applied statisticians:
We agree there are flaws with p-values, just like there are with any statistic one might choose to calculate. In particular, we do think that confidence intervals should be reported with p-values when possible. But we believe that any other decision-making statistic would lead to other problems. One thing we are sure about is that p-values beat scientists’ intuition about chance any day. So before bashing p-values too much we should be careful because, like democracy to government, p-values may be the worst form of statistical significance calculation except all those other forms that have been tried from time to time.
————————————————————————————————————
* Calculated using Google Scholar using the formula:
Number of P-value Citations = # of papers with exact phrase “P < 0.05” + (# of papers with exact phrase “P < 0.01” and not exact phrase “P < 0.05”) + (# of papers with exact phrase “P < 0.001” and not exact phrase “P < 0.05” or “P < 0.001”)
= 1,320,000 + 1,030,000 + 662,500
This is obviously an extremely conservative estimate.