A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature.
As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there.
The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set. The currently available p-value data sets in this package are:
jager2014 - This data set comes from the paper: An estimate of the science-wise false discovery rate and application to the top medical literature that first proposed p-value scraping from the medical literature for re-analysis.brodeur2016 - This data set comes from the paper Star Wars: The empirics strike back which collected p-values from the economics literature.head2015 - This data set comes from the paper The Extent and Consequences of P-Hacking in Science and is an extension of the jager2014 idea to a much larger collection of biological papers.chavalarias2016 - This data set comes from the paper Evolution of Reporting P Values in the Biomedical Literature, 1990-2015 and is an extension of the jager2014 idea to a much larger collection of medical papers.allp - merges the head2015, chavalarias2016, and brodeur2016 while removing duplicates. To see how it is created view the merging vignette.Each data set is “tidy” data frame and has the following columns:
pvalue - The reported p-valueyear - The year of the publication where the p-value appearedjournal - The journal where the publication appearedfield - The field of the paper, using the categorization in Head et al. 2015.abstract - Whether the p-value was in the abstract of the paperoperator - Whether the p-value was reported as “lessthan”, “greaterthan”, or “equals”.doi - When available the digital object identifier.pmid - The pubmed ID for the paper when availableCurrently the package is only available from Github, but when I figure out the ExperimentHub package from Bioconductor I hope to move the package there. For now you can install it with
install.packages('devtools)
library(devtools)
devtools::install_github('jtleek/tidypvals')Then you can load the library and then access each data set by name.
library(tidypvals)
jager2014Data sets can be easily merged, but be careful to avoid duplicated p-values across different data sets. You can see how each data set was obtained and tidied by viewing the corresponding vignette.
vignette("jager-2014",package="tidypvals")One purpose of tidying these data is to be able to do cross-study analysis of p-values in the literature. As a teaser for things coming soon, this plot represents more than 2.5 million p-values across 25 different fields. Notice anything funny?