P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures

Jeff Leek
2015-08-19

Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:

http://xkcd.com/882/

 

One of the most popular ways to correct for multiple testing is to estimate or control the false discovery rate. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold t significant, then borrowing notation from this great introduction to false discovery rates 

fdr3

 

So F(t) is the (unknown) total number of null hypotheses called significant and S(t) is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.

 

fdr4

 

To get an estimate of the FDR we just need an estimate for  E[_F(t)] _ and _E[S(t)]. _The latter is pretty easy to estimate as just the total number of rejections (the number of p < t). If you assume that the p-values follow the expected distribution then E[_F(t)]  _can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by t since the p-values are uniform. To do this, we need an estimate for \pi_0, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.

Combining the above equation with our estimates for E[_F(t)] _ and _E[S(t)] _we get:

 

fdr5

 

The q-value is a multiple testing analog of the p-value and is defined as:

fdr6

 

This is of course a very loose version of this and you can get a more technical description here. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its “testing partners”. Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.

uniform-pvals

Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:

significant-pvals

 

Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.

A couple of things to note: