There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this really awesome post on using metadata to identify “conspirators” in the American revolution. My first thought was: but how do you know that you aren’t just making lots of false discoveries?
Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the “features” you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:
Call Null True | Call Null False | Total | |
---|---|---|---|
Null True | True Negatives | False Positives | True Nulls |
Null False | False Negatives | True Positives | False Nulls |
No Decisions | Rejections |
The reason for “No Decisions” is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.
A very common measure of embarrassment in multiple hypothesis testing scenarios is the false discovery rate defined as:
.
There are some niceties that have to be dealt with here, like the fact that the may be equal to zero, inspiring things like the positive false discovery rate, which has some nice Bayesian interpretations.
The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and estimate the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that controls the FDR where control is defined usually by saying something like the algorithm produces .
Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic \(T\_k\) significant and \(T\_j > T\_k\) then your method should also call statistic \(T\_j\) significant. In the absence of extra information, this is a very reasonable assumption.
But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test \(j\).
Example 1 A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise in the group. In this case, you might want to know something about the level of noise if you call the whole set interesting.
Example 2 Suppose you are a mysterious government agency and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - say using betweenness centrality. You could calculate a P-value for each individual, say using a randomization test. Then estimate your FDR based on predictions using the metadata.
Example 3 You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims?
For now I’m going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well.
In each of these cases you have a scenario where you are interested in something like:
where is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:
Example 1
Example 2
Example 3
Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like cubic splines.
The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let’s focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables \(G(\lambda)\) is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that \(\lambda\) is large enough, then . So
One obvious choice is then to try to model
We could, for example use the model:
where is a linear model or spline, etc. Then we get the fitted values and calculate:
Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.
## Load libraries library(splines) ## Define the number of tests set.seed(1345) ntest <- 1000 ## Set up the time vector and the probability of being null tme <- seq(-2,2,length=ntest) pi0 <- pnorm(tme) ## Calculate a random variable indicating whether to draw ## the p-values from the null or alternative nullI <- rbinom(ntest,prob=pi0,size=1)> 0 ## Sample the null P-values from U(0,1) and the alternatives ## from a beta distribution pValues <- rep(NA,ntest) pValues[nullI] <- runif(sum(nullI)) pValues[!nullI] <- rbeta(sum(!nullI),1,50) ## Set lambda and calculate the estimate lambda <- 0.8 y <- pValues > lambda glm1 <- glm(y ~ ns(tme,df=3)) ## Get the estimate pi0 values pi0hat <- glm1$fitted/(1-lambda) ## Plot the real versus fitted probabilities plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0") abline(c(0,1),col="grey",lwd=3)
The result is this plot:
Real versus estimated false discovery rate when calling all tests significant.
This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn’t the idea of FDR regression cool?