Announcing the tidypvals package

A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature.

As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there.

The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set. The currently available p-value data sets in this package are:

Each data set is “tidy” data frame and has the following columns:

• pvalue - The reported p-value
• year - The year of the publication where the p-value appeared
• journal - The journal where the publication appeared
• field - The field of the paper, using the categorization in Head et al. 2015.
• abstract - Whether the p-value was in the abstract of the paper
• operator - Whether the p-value was reported as “lessthan”, “greaterthan”, or “equals”.
• doi - When available the digital object identifier.
• pmid - The pubmed ID for the paper when available

Currently the package is only available from Github, but when I figure out the ExperimentHub package from Bioconductor I hope to move the package there. For now you can install it with

install.packages('devtools)
library(devtools)
devtools::install_github('jtleek/tidypvals')


Then you can load the library and then access each data set by name.

library(tidypvals)
jager2014


Data sets can be easily merged, but be careful to avoid duplicated p-values across different data sets. You can see how each data set was obtained and tidied by viewing the corresponding vignette.

vignette("jager-2014",package="tidypvals")


One purpose of tidying these data is to be able to do cross-study analysis of p-values in the literature. As a teaser for things coming soon, this plot represents more than 2.5 million p-values across 25 different fields. Notice anything funny?