27
Jun

## What is the Best Way to Analyze Data?

One topic I've been thinking about recently is extent to which data analysis is an art versus a science. In my thinking about art and science, I rely on Don Knuth's distinction, from his 1974 lecture "Computer Programming as an Art":

Science is knowledge that we understand so well that we can teach it to a computer; and if we don't fully understand something, it is an art to deal with it. Since the notion of an algorithm or a computer program provides us with an extremely useful test for the depth of our knowledge about any given subject, the process of going from an art to a science means that we learn how to automate something.

Of course, the phrase "analyze data" is far too general; it needs to be placed in a much more specific context. So choose your favorite specific context and consider this question: Is there a way to teach a computer how to analyze the data generated in that context? Jeff wrote about this a while back and he called this magical program the deterministic statistical machine.

For example, one area where I've done some work is in estimating short-term/acute population-level effects of ambient air pollution. These are typically done using time series data of ambient pollution from central monitors and community-level counts of some health outcome (e.g. deaths, hospitalizations). The basic question is if pollution goes up on a given day, do we also see health outcomes go up on the same day, or perhaps in the few days afterwards. This is a fairly well-worn question in the air pollution literature and there have been hundreds of time series studies published. Similarly, there has been a lot of research into the statistical methodology for conducting time series studies and I would wager that as a result of that research we actually know something about what to do and what not to do.

But is our level of knowledge about the methodology for analyzing air pollution time series data to the point where we could program a computer to do the whole thing? Probably not, but I believe there are aspects of the analysis that we could program.

Here's how I might break it down. Assume we basically start with a rectangular dataset with time series data on a health outcome (say, daily mortality counts in a major city), daily air pollution data, and daily data on other relevant variables (e.g. weather). Typically, the target of analysis is the association between the air pollution variable and the outcome, adjusted for everything else.

1. Exploratory analysis. Not sure this can be fully automated. Need to check for missing data and maybe stop analysis if proportion of missing data is too high? Check for high leverage points as pollution data tends to be skewed. Maybe log-transform if that makes sense in this context. Check for other outliers and note them for later (we may want to do a sensitivity analysis without those observations).
2. Model fitting. This is already fully automated. If the outcome is a count, then typically a Poisson regression model is used. We already know that maximum likelihood is an excellent approach and better than most others under reasonable circumstances. There's plenty of GLM software out there so we don't even have to program the IRLS algorithm.
3. Model building. Since this is not a prediction model, the main concern we have is that we properly adjusted for measured and unmeasured confounding. Francesca Dominici and some of her colleagues have done some interesting work regarding how best to do this via Bayesian model averaging and other approaches. I would say that in principle this can be automated, but the lack of easy-to-use software at the moment makes it a bit complicated. That said, I think simpler versions of the "ideal approach" can be easily implemented.
4. Sensitivity analysis. There are a number of key sensitivity analyses that need to be done in all time series analyses. If there were outliers during EDA, maybe re-run model fit and see if regression coefficient for pollution changes much. How much is too much? (Not sure.) For time series models, unmeasured temporal confounding is a big issue so this is usually checked using spline smoothers on the time variable with different degrees of freedom. This can be automated by fitting the model many different times with different degrees of freedom in the spline.
5. Reporting. Typically, some summary statistics for the data are reported along with the estimate + confidence interval for the air pollution association. Estimates from the sensitivity analysis should be reported (probably in an appendix), and perhaps estimates from different lags of exposure, if that's a question of interest. It's slightly more complicated if you have a multi-city study.

So I'd say that of the five major steps listed above, the one that I find most difficult to automate is EDA. There a lot of choices have to be made that are not easy to program into a computer. But I think the rest of the analysis could be automated. I've left out the cleaning and preparation of the data here, which also involves making many choices. But in this case, much of that is often outside the control of the investigator. These analyses typically use publicly available data where the data are available "as-is". For example, the investigator would likely have no control over how the mortality counts were created.

What's the point of all this? Well, I would argue that if we cannot completely automate a data analysis for a given context, then either we need to narrow the context, or we have some more statistical research to do. Thinking about how one might automate a data analysis process is a useful way to identify where are the major statistical gaps in a given area. Here, there may be some gaps in how best to automate the exploratory analyses. Whether those gaps can be filled (or more importantly, whether you are interested in filling them) is not clear. But most likely it's not a good idea to think about better ways to fit Poisson regression models.

So what do you do when all of the steps of the analysis have been fully automated? Well, I guess time to move on then....

26
Jun

## Art from Data

There's a nice piece by Mark Hansen about data-driven aesthetics in the New York Times special section on big data.

From a speedometer to a weather map to a stock chart, we routinely interpret and act on data displayed visually. With a few exceptions, data has no natural “look,” no natural “visualization,” and choices have to be made about how it should be displayed. Each choice can reveal certain kinds of patterns in the data while hiding others.

I think drawing a line between a traditional statistical graphic and a pure work of art would be somewhat difficult. You can find examples of both that might fall in the opposite category: traditional graphics that transcend their utilitarian purposes and "pure art" works that tell you something new about your world.

Indeed, I think Mark Hansen's own work with Ben Rubin falls into the latter category--art pieces that perhaps had their beginnings as purely works of art but ended up giving you new insight into the world. For example, Listening Post was a highly creative installation that simultaneously gave you an emotional connection to random people chatting on the Internet as well as insight into what the Internet was "saying" at any given time (I wonder if NSA employees took a field trip to the Whitney Museum of American Art!).

25
Jun

## Doing Statistical Research

There's a wonderful article over at the STATtr@k web site by Terry Speed on How to Do Statistical Research. There is a lot of good advice there, but the column is most notable because it's pretty much the exact opposite of the advice that I got when I first started out.

To quote the article:

The ideal research problem in statistics is “do-able,” interesting, and one for which there is not much competition. My strategy for getting there can be summed up as follows:

• Consulting: Do a very large amount
• Collaborating: Do quite a bit
• Research: Do some

For the most part, I was told to flip the research and consulting bits. That is, you want to spend most of your time doing "research" and very little of your time doing "consulting". Why? Because ostensibly, the consulting work doesn't involve new problems, only solving old problems with existing techniques. The research work by definition involves addressing new problems.

But,

A strategy I discourage is “develop theory/model/method, seek application.” Developing theory, a model, or a method suggests you have done some context-free research; already a bad start. The existence of proof (Is there a problem?) hasn’t been given. If you then seek an application, you don’t ask, “What is a reasonable way to answer this question, given this data, in this context?” Instead, you ask, “Can I answer the question with this data; in this context; with my theory, model, or method?” Who then considers whether a different (perhaps simpler) answer would have been better?

The truth is, most problems can be solved with an existing method. They may not be 100% solvable with existing tools, but usually 90% is good enough and it's not worth developing a new statistical method to cover the remaining 10%. What you really want to be doing is working on the problem that is 0% solvable with existing methods. Then there's a pretty big payback if you develop a new method to address it and it's more likely that your approach will be adopted by others simply because there's no alternative. But in order to find these 0% problems, you have to see a lot of problems, and that's where the consulting and collaboration comes in. Exposure to lots of problems lets you see the universe of possibilities and gives you a sense of where scientists really need help and where they're more or less doing okay.

Even if you agree with Terry's advice, implementing it may not be so straightforward. It may be easier/harder to do consulting and collaboration depending on where you work. Also, finding good collaborators can be tricky and may involve some trial and error.

But it's useful to keep this advice in mind, especially when looking for a job. The places you want to be on the lookout for are places that give you the most exposure to interesting scientific problems, the 0% problems. These places will give you the best opportunities for collaboration and for having a real impact on science.

24
Jun

## Does fraud depend on my philosophy?

Ever since my last post on replication and fraud I've been doing some more thinking about why people consider some things "scientific fraud". (First of all, let me just say that I was a bit surprised by the discussion in the comments for that post. Some people apparently thought I was asking about the actual probability that the study was a fraud. This was not the case. I just wanted people to think about how they would react when confronted with the scenario.)

I often find that when I talk to people about the topic of scientific fraud, especially statisticians, there is a sense that much work that goes on out there is fraudulent, but the precise argument for why is difficult to pin down.

Consider the following three cases:

1. I conduct a randomized clinical trial comparing a new treatment and a control and their effect on outcome Y1. I also collect data on outcomes Y2, Y3, ... Y10. After conducting the trial I see that there isn't a significant difference for Y1 so I test the other 9 outcomes and find a significant effect (defined as p-value equal to 0.04) for Y7. I then publish a paper about outcome Y7 and state that it's significant with p=0.04. I make no mention of the other outcomes.
2. I conduct the same clinical trial with the 10 different outcomes and look at the difference between the treatment groups for all outcomes. I notice that the largest standardized effect size is for Y7 with a standardized effect of 3, suggesting the treatment is highly effective in this trial. I publish a paper about outcome Y7 and state that the standardized effect size was 3 for comparing treatment vs. control. I note that a difference of 3 is highly significant, but I make no mention of statistical significance or p-values. I also make no mention of the other outcomes.
3. I conduct the same clinical trial with the 10 outcomes. Now I look at all 10 outcomes and calculate the posterior probability that the effect is greater than zero (favoring the new treatment), given a pre-specified diffuse prior on the effect (assume it's the same prior for each effect). Of the 10 outcomes I see that Y7 has the largest posterior probability of 0.98. I publish a paper about Y7 stating that my posterior probability for a positive effect is 0.98. I make no mention of the other outcomes.

Which one of these cases constitutes scientific fraud?

1. I think most people would object to Case 1. This is the classic multiple testing scenario where the end result is that the stated p-value is not correct. Rather than a p-value of 0.04 the real p-value is more like 0.4. A simple Bonferroni correction fixes this but obviously would have resulted in not finding any significant effects based on a 0.05 threshold. The real problem is that in Case 1 you are clearly trying to make an inference about future studies. You're saying that if there's truly no difference, then in 100 other studies just like this one, you'd expect only 4 to detect a difference under the same criteria that you used. But it's incorrect to say this and perhaps fraudulent (or negligent) depending on your underlying intent. In this case a relevant detail that is missing is the number of other outcomes that were tested.
2. Case 2 differs from case 1 only in that no p-values are used but rather the measure of significance is the standardized effect size. Therefore, no probability statements are made and no inference is made about future studies. Although the information about the other outcomes is similarly omitted in this case as in case 1, it's difficult for me to identify what is wrong with this paper.
3. Case 3 takes a Bayesian angle and is more or less like case 2 in my opinion. Here, probability is used as a measure of belief about a parameter but no explicit inferential statements are made (i.e. there is no reference to some population of other studies). In this case I just state my belief about whether an effect/parameter is greater than 0. Although I also omit the other 9 outcomes in the paper, revealing that information would not have changed anything about my posterior probability.

In each of these three scenarios, the underlying data were generated in the exact same way (let's assume for the moment that the trial itself was conducted with complete integrity).  In each of the three scenarios, 10 outcomes were examined and outcome Y7 was in some sense the most interesting.

Of course, the analyses and the interpretation of the data were not the same in each scenario. Case 1 makes an explicit inference whereas Cases 2 and 3 essentially do not. However, I would argue the evidence about the new treatment compared to the control treatment in each scenario was identical.

I don't believe that the investigator in Case 1 should be allowed to engage in such shenanigans with p-values, but should he/she be pilloried simply because the p-value was the chosen metric of significance? I guess the answer would be "yes" for many of you, but keep in mind that the investigator in Case 1 still generated the same evidence as the others. Should the investigators in Case 2 and Case 3 be thrown in the slammer? If so, on what basis?

My feeling is not that people should be allowed to do whatever they please, but we need a better way to separate the "stuff" from the stuff. This is both a methodological and a communications issue. For example, Case 3 may not be fraud but I'm not necessarily interested in what the investigator's opinion about a parameter is. I want to know what the data say about that parameter (or treatment difference in this case). Is it fraud to make any inferences in the first place (as in Case 1)? I mean, how could you possible know that your inference is "correct"? If "all models are wrong, but some are useful", does that mean that everyone is committing fraud?

23
Jun

## Sunday data/statistics link roundup (6/23/13)

1. An interesting study describing the potential benefits of using significance testing may be potentially beneficial and a scenario where the file drawer effect may even be beneficial. Granted this is all simulation so you have to take it with a grain of salt, but I like the pushback against the hypothesis testing haters. In all things moderation, including hypothesis testing.
2. Venn Diagrams for the win, bro.
3. The new basketball positions. The idea is to cluster players based on the positions on the floor where they shoot, etc. I like the idea of data driven position definitions; I am a little worried about "reading ideas in" to a network picture.
4. A really cool idea about a startup that makes data on healthcare procedures available to patients. I'm all about data transparency, but it makes me wonder, how often do people with health insurance negotiate the prices of procedures (via Leah J.)
5. Another interesting article about using tweets (and other social media) to improve public health. I do wonder about potential sampling issues, like what happened with google flu trends (via Nick C.)
21
Jun

## Interview with Miriah Meyer - Microsoft Faculty Fellow and Visualization Expert

Miriah Meyer received her Ph.D. in computer science from the University of Utah, then did a postdoctoral fellowship at Harvard University and was a visiting fellow at MIT and the Broad Institute. Her research focuses on developing visualization tools in close collaboration with biological scientists. She has been recognized as a Microsoft Faculty Fellow, a TED Fellow, and appeared on the TR35. We talked with Miriah about visualization, collaboration, and her influences during her career as part of the Simply Statistics Interview Series.

SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?

MM: My training is as a computer scientist and much of the way I problem solve is grounded in computational thinking. I do, however, sometimes think of myself as a data counselor, as a big part of what I do is help my collaborators move towards a deeper and more articulate statement about what they want/need to do with their data.

SS: Most data analysis is done by scientists, not trained statisticians. How does data visualization help/hurt scientists when looking at piles of complicated data?

MM: In the sciences, visualization is particularly good for hypothesis generation and early stage exploration. With many fields turning toward data-driven approaches, scientists are often not sure of exactly what they will find in a mound of data. Visualization allows them to look into the data without having to specify a specific question, query, or model. This early, exploratory analysis is very difficult to do strictly computationally. Exploration via interactive visualization can lead a scientist towards establishing a more specific question of the data that could then be addressed algorithmically.
SS: What are the steps in developing a visualization with a scientific collaborator?

MM: The first step is finding good collaborators

The beginning of a project is spent in discussions with the scientists, trying to understand their needs, data, and mental models of a problem. I find this part to be the most interesting, and also the most challenging. The goal is to develop a clear, abstract understanding of the problem and set of design requirements. We do this through interviews and observations, with a focus on understanding how people currently solve problems and what they want to do but can't with current tools.

Next is to take this understanding and prototype ideas for visualization designs. Rapid prototyping on paper is usually first, followed by more sophisticated, software prototypes after getting feedback from the collaborators. Once a design is sufficiently fleshed out and validated, a (relatively) full-featured visualization tool is developed and deployed.

At this point, the scientists tend to realize that the problem they initially thought was most interesting isn't... and the cycle continues.

Fast iteration is really essential in this process. In the past I've gone through as many as three cycles of this process before find the right problem abstractions and designs.

SS: You have tackled some diverse visualizations (from synteny to poems); what are the characteristics of a problem that make it a good candidate for new visualizations?

MM: For me, the most important thing is to find good collaborators. It is essential to find partners that are willing to give lots of their time up front, are open-minded about research directions, and are working on cutting-edge problems in their field. This latter characteristic helps to ensure that there will be something novel needed from a data analysis and visualization perspective.

The other thing is to test whether a problem passes the Tableau/R/Matlab test: if the problem can't be solved using one of these environments, then that is probably a good start.
SS: What is the four-level nested model for design validation and how did you extend it?

MM: This is a design decision model that helps to frame the different kinds of decisions made in the visualization design process, such as decisions about data derivations, visual representations, and algorithms. The model helps to put any one decision in the context of other visualization ideas, methods, and techniques, and also helps a researcher generalize new ideas to a broader class of problems. We recently extended this model to specifically define what a visualization "guideline" is, and how to relate this concept to how we design and evaluate visualizations.

SS: Who are the key people who have been positive influences on your career and how did they help you?

MM: One influence that jumps out to me is a collaboration with a designer in Boston named Bang Wong. Working with Bang completely changed my approach to visualization development and got me thinking about iteration, rapid prototyping, and trying out many ideas before committing. Also important were two previous supervisors, Ross Whitaker and Tamara Munzner, who constantly pushed me to be precise and articulate about problems and approaches to them. I believe that precision is a hallmark of good data science, even when characterizing unprecise things

SS: Do you have any advice for computer scientists/statisticians who want to work on visualization as a research area?

MM: Do it! Visualization is a really fun, vibrant, growing field. It relies on a broad spectrum of skills, from computer science, to design, to collaboration. I would encourage those interested to not get to infatuated with the engineering or the aesthetics and to instead focus on solving real-world problems. There is an unlimited supply of those!

20
Jun

## Google's brainteasers (that don't work) and Johns Hopkins Biostatistics Data Analysis

This article is getting some attention, because Google's VP for people operations at Google has made public a few insights that the Google HR team has come to over the last several years. The most surprising might be:

1. They don't collect GPAs except for new candidates
2. Test scores are worthless
3. Interview scores weren't correlated with success.
4. Brainteasers that Google is so famous for are worthless
5. Behavioral interviews are the most effective

The reason the article is getting so much attention is how surprising these facts may be to people who have little experience hiring/managing in technical fields. But I thought this quote was really telling:

One of my own frustrations when I was in college and grad school is that you knew the professor was looking for a specific answer. You could figure that out, but it’s much more interesting to solve problems where there isn’t an obvious answer.

Interestingly, that is the whole point of my data analysis course here at Hopkins. Over my relatively limited time as a faculty member I realized there were two key qualities that made students in biostatistics stand out: (1) that they were hustlers - willing to just work until the problem is solved even if it was frustrating and (2) that they were willing/able to try new approaches or techniques they weren't comfortable with. I don't have the quantitative data that Google does, but I would venture to guess those two traits explain 80%+ of the variation in success rates for graduate students in statistics/computing/data analysis.

Once that realization is made, it becomes clear pretty quickly that textbook problems or re-analysis of well known data sets measure something orthogonal to traits (1) and (2). So I went about redesigning the types of problems our students had to tackle. Instead of assigning problems out of a book I redesigned the questions to have the following characteristics:

1. The were based on live data sets. I define a "live" data set as a data set that has not been used to answer the question of interest previously.
2. The questions are problem forward, not solution backward. I would have an idea of what would likely work and what would likely not work. But I defined the question without thinking about what methods the students might use.
3. The answer was open ended (and often not known to me in advance).
4. The problems often had to do with unique scenarios not encountered frequently in statistics (e.g. you have a data census instead of just a sample).
5. The problems involved methods application/development, coding, and writing/communication.

I have found that problems with these characteristics more precisely measure hustle and flexibility, like Google is looking for in their hiring practices. Of course, there are some down sides to this approach. I think it can be more frustrating for students, who don't have as clearly defined a path through the homework. It also means dramatically more work for the instructor in terms of analyzing the data to find the quirks, creating personalized feedback for students, and being able to properly estimate the amount of work a project will take.

We have started thinking about how to do this same thing at scale on Coursera. In the meantime, Google will just have to send their recruiters to Hopkins Biostats to find students who meet the characteristics they are looking for :-).

16
Jun

## Sunday data/statistics link roundup (6/16/13 - Father's day edition!)

1. Datapalooza! I'm wondering where my invite is? I do health data stuff, pick me, pick me! Actually it does sound like a pretty good idea - in general giving a bunch of smart people access to interesting data and real science problems can produce some cool results (link via Dan S.)
2. This report on precision medicine from the Manhattan Institute is related to my post this week on personalized medicine. I like the idea that we should be focusing on developing new ideas for adaptive trials (my buddy Michael is all over that stuff). I did thing that it was a little pie-in-the-sky with plenty of buzzwords like Bayesian causal networks and pattern recognition. I think these ideas are certainly applicable, but the report, I think, overstates the current level of applicability of these methods. We need more funding and way more research to support this area before we should automatically adopt it - big data can be used to confuse when methods aren't well understood (link via Rafa via Marginal Revolution).
3. rOpenSci wins a grant from the Sloan Foundation! Psyched to see this kind of innovative open software development get the support it deserves. My favorite rOpenSci package is rFigshare, what's yours?
4. A k-means approach to detecting what will be trending on Twitter. It always gets me so pumped up to see the creative ways that methods that have been around forever can be adapted to solve real, interesting problems.
5. Finally, I thought this link was very appropriate for father's day. I couldn't agree more that the best kind of learning happens when you are just so in to something that you forget you are learning. Happy father's day everyone!
14
Jun

## The vast majority of statistical analysis is not performed by statisticians

Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. The only data that were collected were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy (see e.g. Camp Williams).

To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. This is where the idea of “6-degrees of Kevin Bacon” comes from. Based on 64 data points.  A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort.

Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome. This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about$10,000 in about a week.

As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. It started with maps of temperatures across the U.S. in newspapers and quickly ramped up to information on how many friends you had on Facebook, the price of tickets on 50 airlines for the same flight, or measurements of your blood pressure, good cholesterol, and bad cholesterol at every doctor’s visit. Arguments about politics started focusing on the results of opinion polls and who was asking the questions. The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.

That is when statisticians stopped being the primary data analysts. At some point, the trickle of data about you, your friends, and the world started impacting every component of your life. Now almost every decision you make is based on data you have about the world around you. Let’s take something simple, like where are you going to eat tonight. You might just pick the nearest restaurant to your house. But you could also ask your friends on Facebook where you should eat, or read reviews on Yelp, or check out menus on the restaurants websites. All of these are pieces of data that are collected and presented for you to "analyze".

This revolution demands a new way of thinking about statistics. It has precipitated explosive growth in data visualization - the most accessible form of data analysis. It has encouraged explosive growth in MOOCs like the ones Roger, Brian and I taught. It has created open data initiatives in government. It has also encouraged more accessible data analysis platforms in the form of startups like StatWing that make it easier for non-statisticians to analyze data.

What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

13
Jun

## False discovery rate regression (cc NSA's PRISM)

There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this really awesome post on using metadata to identify "conspirators" in the American revolution. My first thought was: but how do you know that you aren't just making lots of false discoveries?

Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the "features" you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:

Call Null True Call Null False Total
Null True True Negatives False Positives True Nulls
Null False False Negatives True Positives False Nulls
No Decisions Rejections

The reason for "No Decisions" is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.

A very common measure of embarrassment in multiple hypothesis testing scenarios is the false discovery rate defined as:

.

There are some niceties that have to be dealt with here, like the fact that the $\# of Rejections$ may be equal to zero, inspiring things like the positive false discovery rate, which has some nice Bayesian interpretations.

The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and estimate the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that controls the FDR where control is defined usually by saying something like the algorithm produces $E[FDP] \leq FDR$.

Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic $T_k$ significant and $T_j > T_k$ then your method should also call statistic $T_j$ significant. In the absence of extra information, this is a very reasonable assumption.

But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test $j$.

Example 1 A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise in the group. In this case, you might want to know something about the level of noise if you call the whole set interesting.

Example 2 Suppose you are a mysterious government agency and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - say using betweenness centrality. You could calculate a P-value for each individual, say using a randomization test. Then estimate your FDR based on predictions using the metadata.

Example 3 You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims?

For now I'm going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well.

In each of these cases you have a scenario where you are interested in something like:

where $fdr(x)$ is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:

Example 1

Example 2

Example 3

Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like cubic splines.

The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let's focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables $Y_i = 1(P_i > \lambda)$. Then:

Where $G(\lambda)$ is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that $\lambda$ is large enough, then $G(\lambda) \approx 1$. So

One obvious choice is then to try to model

We could, for example use the model:

where $f(x)$ is a linear model or spline, etc. Then we get the fitted values and calculate:

Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.



library(splines)
## Define the number of tests
set.seed(1345)
ntest <- 1000

## Set up the time vector and the probability of being null
tme <- seq(-2,2,length=ntest)
pi0 <- pnorm(tme)

## Calculate a random variable indicating whether to draw
## the p-values from the null or alternative
nullI <- rbinom(ntest,prob=pi0,size=1)> 0

## Sample the null P-values from U(0,1) and the alternatives
## from a beta distribution

pValues <- rep(NA,ntest)
pValues[nullI] <- runif(sum(nullI))
pValues[!nullI] <- rbeta(sum(!nullI),1,50)

## Set lambda and calculate the estimate

lambda <- 0.8
y <- pValues > lambda
glm1 <- glm(y ~ ns(tme,df=3))

## Get the estimate pi0 values
pi0hat <- glm1\$fitted/(1-lambda)

## Plot the real versus fitted probabilities

plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0")
abline(c(0,1),col="grey",lwd=3)


The result is this plot:

Real versus estimated false discovery rate when calling all tests significant.

This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn't the idea of FDR regression cool?