Announcing the tidypvals package

Jeff Leek
2017-07-26

A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature.

As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there.

The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set. The currently available p-value data sets in this package are:

Each data set is “tidy” data frame and has the following columns:

Currently the package is only available from Github, but when I figure out the ExperimentHub package from Bioconductor I hope to move the package there. For now you can install it with

install.packages('devtools)
library(devtools)
devtools::install_github('jtleek/tidypvals')

Then you can load the library and then access each data set by name.

library(tidypvals)
jager2014

Data sets can be easily merged, but be careful to avoid duplicated p-values across different data sets. You can see how each data set was obtained and tidied by viewing the corresponding vignette.

vignette("jager-2014",package="tidypvals")

One purpose of tidying these data is to be able to do cross-study analysis of p-values in the literature. As a teaser for things coming soon, this plot represents more than 2.5 million p-values across 25 different fields. Notice anything funny?

All p-values