False discovery rate regression (cc NSA's PRISM)

There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this really awesome post on using metadata to identify “conspirators” in the American revolution. My first thought was: but how do you know that you aren’t just making lots of false discoveries?

Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the “features” you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:

Call Null True Call Null False Total
Null True True Negatives False Positives True Nulls
Null False False Negatives True Positives False Nulls
No Decisions Rejections

The reason for “No Decisions” is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.

A very common measure of embarrassment in multiple hypothesis testing scenarios is the false discovery rate defined as:

 FDR = E\left[\frac{\# of False Positives}{\# of Rejections}\right]

. There are some niceties that have to be dealt with here, like the fact that the \# of Rejections may be equal to zero, inspiring things like the [positive false discovery rate](http://www.genomine.org/papers/directfdr.pdf), which has [some nice Bayesian interpretations](http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf). The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and _estimate_ the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that _controls_ the FDR where control is defined usually by saying something like the algorithm produces E[FDP] \leq FDR. Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic $T\_k$ significant and $T\_j > T\_k$ then your method should also call statistic $T\_j$ significant. In the absence of extra information, this is a very reasonable assumption. But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test $j$. **Example 1 **A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise _in the group_. In this case, you might want to know something about the level of noise if you call the whole set interesting. **Example 2** Suppose you are a [mysterious government agency](http://www.nsa.gov/) and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - [say using betweenness centrality](http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/). You could calculate a P-value for each individual, [say using a randomization test](http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/). Then estimate your FDR based on predictions using the metadata. **Example 3 **You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims? For now I'm going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well. In each of these cases you have a scenario where you are interested in something like:

 E\left[\frac{V}{R} | X=x\right] = fdr(x)

where fdr(x) is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:

Example 1

 E\left[\frac{V}{R} | GS = k\right] =\beta_0 + \sum_{\ell=1}^K\beta_{\ell} 1(GS=\ell)

Example 2

 E\left[\frac{V}{R} | Person , Age\right] =\beta_0 + \gamma Age + \sum_{\ell=1}^K\beta_{\ell}1(Person = \ell)

Example 3

 E\left[\frac{V}{R} | Time \right] =\beta_0 + \sum_{\ell =1}^{K} s_{\ell}(time)

Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like cubic splines.

The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let’s focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables Y_i = 1(P_i > \lambda)” /></span>. Then:</p>

<p style='text-align:center;'>
  <span class='MathJax_Preview'><img src='http://simplystatistics.org/wp-content/plugins/latex/cache/tex_3fab11118e2a7046c2f292c6b8c4b02c.gif' style='vertical-align: middle; border: none;' class='tex' alt= \lambda) = (1-\lambda)*\pi_0 + (1-G(\lambda))*(1-\pi_0)" />

Where $G(\lambda)$ is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that $\lambda$ is large enough, then G(\lambda) \approx 1. So

 E[Y_i] \approx (1-\lambda) \pi_0

One obvious choice is then to try to model

 E[Y_i | X = x] \approx (1-\lambda) \pi_0(x)

We could, for example use the model:

 logit(E[Y_i | X = x]) = f(x)

where f(x) is a linear model or spline, etc. Then we get the fitted values and calculate:

\hat{\pi}_0(x) = \hat{E}[Y_i | X=x] /(1-\lambda)

Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.

## Load libraries

library(splines)
## Define the number of tests
set.seed(1345)
ntest <- 1000

## Set up the time vector and the probability of being null
tme <- seq(-2,2,length=ntest)
pi0 <- pnorm(tme)

## Calculate a random variable indicating whether to draw
## the p-values from the null or alternative
nullI <- rbinom(ntest,prob=pi0,size=1)> 0

## Sample the null P-values from U(0,1) and the alternatives
## from a beta distribution

pValues <- rep(NA,ntest)
pValues[nullI] <- runif(sum(nullI))
pValues[!nullI] <- rbeta(sum(!nullI),1,50)

## Set lambda and calculate the estimate

lambda <- 0.8
y <- pValues > lambda
glm1 <- glm(y ~ ns(tme,df=3))

## Get the estimate pi0 values
pi0hat <- glm1$fitted/(1-lambda)

## Plot the real versus fitted probabilities

plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0")
abline(c(0,1),col="grey",lwd=3)

The result is this plot:

pi0

Real versus estimated false discovery rate when calling all tests significant.

This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn’t the idea of FDR regression cool?

 
comments powered by Disqus