Announcing the tidypvals package

A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature.

As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there.

The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set. The currently available p-value data sets in this package are:

Each data set is “tidy” data frame and has the following columns:

  • pvalue - The reported p-value
  • year - The year of the publication where the p-value appeared
  • journal - The journal where the publication appeared
  • field - The field of the paper, using the categorization in Head et al. 2015.
  • abstract - Whether the p-value was in the abstract of the paper
  • operator - Whether the p-value was reported as “lessthan”, “greaterthan”, or “equals”.
  • doi - When available the digital object identifier.
  • pmid - The pubmed ID for the paper when available

Currently the package is only available from Github, but when I figure out the ExperimentHub package from Bioconductor I hope to move the package there. For now you can install it with

install.packages('devtools)
library(devtools)
devtools::install_github('jtleek/tidypvals')

Then you can load the library and then access each data set by name.

library(tidypvals)
jager2014

Data sets can be easily merged, but be careful to avoid duplicated p-values across different data sets. You can see how each data set was obtained and tidied by viewing the corresponding vignette.

vignette("jager-2014",package="tidypvals")

One purpose of tidying these data is to be able to do cross-study analysis of p-values in the literature. As a teaser for things coming soon, this plot represents more than 2.5 million p-values across 25 different fields. Notice anything funny?

All p-values

 
comments powered by Disqus