Simply Statistics


Apple Music's Moment of Truth

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Today is the day when Apple, Inc. learns whether it's brand new streaming music service, Apple Music, is going to be a major contributor to the bottom line or just another streaming service (JASS?). Apple Music launched 3 months ago and all new users are offered a 3-month free trial. Today, that free trial ends and the big question is how many people will start to pay for their subscription, as opposed to simply canceling it. My guess is that most people (> 50%) will opt to pay, but that's a complete guess. For what it's worth, I'll be paying for my subscription. After adding all this music to my library, I'd hate to see it all go away.

Back on August 18, 2015, consumer market research firm MusicWatch released a study that claimed, among other things, that

Among people who had tried Apple Music, 48 percent reported they are not currently using the service.

This would suggest that almost half of people who had signed up for the free trial period of Apple Music were not interested in using it further and would likely not pay for it once the trial ended. If it were true, it would be a blow to the newly launched service.

But how did MusicWatch arrive at its number? It claimed to have surveyed 5,000 people in its study. Shortly before the survey by MusicWatch was released, Apple claimed that about 11 million people had signed up for their new Apple Music service (because the service had just launched, everyone who had signed up was in the free trial period). Clearly, 5,000 people do not make up the entire population, so we have but a small sample of users.

What is the target that MusicWatch was trying to answer? It seems that they wanted to know the percentage of all people who had signed up for Apple Music that were still using the service. Can they make inference about the entire population from the sample of 5,000?

If the sample is representative and the individuals are independent, we could use the number 48% as an estimate of the percentage in the population who no longer use the service. The press release from MusicWatch did not indicate any measure of uncertainty, so we don't know how reliable the number is.

Interestingly, soon after the MusicWatch survey was released, Apple released a statement to the publication The Verge, stating that 79% of users who had signed up were still using the service (i.e. only 21% had stopped using it, as opposed to 48% reported by MusicWatch). In other words, Apple just came out and gave us the truth! This was unusual because Apple typically does not make public statements about newly launched products. I just found this amusing because I've never been in a situation where I was trying to estimate a parameter and then someone later just told me what its value was.

If we believe that Apple and MusicWatch were measuring the same thing in their analyses (and it's not clear that they were), then it would suggest that MusicWatch's estimate of the population percentage (48%) was quite far off from the true value (21%). What would explain this large difference?

  1. Random variation. It's true that MusicWatch's survey was a small sample relative to the full population, but the sample was still big with 5,000 people. Furthermore, the analysis was fairly simple (just taking the proportion of users still using the service), so the uncertainty associated with that estimate is unlikely to be that large.
  2. Selection bias. Recall that it's not clear how MusicWatch sampled its respondents, but it's possible that the way that they did it led them to capture a set of respondents who were less inclined to use Apple Music. Beyond this, we can't really say more without knowing the details of the survey process.
  3. Respondents are not independent. It's possible that the survey respondents are not independent of each other. This would primiarily affect the uncertainty about the estimate, making it larger than we might expect if the respondents were all independent. However, since we do not know what MusicWatch's uncertainty about their estimate was in the first place, it's difficult to tell if dependence between respondents could play a role. Apple's number, of course, has no uncertainty.
  4. Measurement differences. This is the big one, in my opinion. We don't know is how either MusicWatch or Apple defined "still using the service". You could imagine a variety of ways to determine whether a person was still using the service. You could ask "Have you used it in the last week?" or perhaps "Did you use it yesterday?" Responses to these questions would be quite different and would likely lead to different overall percentages of usage.

We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

You can sign up following links here

Last semester we successfully ran version 2 of my Data Analysis course. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:

  • We no longer use R packages that Microsoft Windows users had trouble installing in the first course.
  • All courses are now designed to be completed in 4 weeks.
  • We added new assessment questions.
  • We improved the assessment questions determined to be problematic.
  • We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.
  • We consolidated the case studies into one course.
  • We combined the materials from the statistics courses into a book, which you can download here. The material in the book match the materials taught in class so you can use it to follow along.

You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.

  1. Statistics and R for the Life Sciences starts October 15.
  2. Introduction to Linear Models and Matrix Algebra starts November 15.
  3. Statistical Inference and Modeling for High-throughput Experiments starts December 15.
  4. High-Dimensional Data Analysis starts January 15.
  5. Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays starts February 15.
  6. High-performance Computing for Reproducible Genomics starts March 15.
  7. Case Studies in Functional Genomics start April 15.

The landing page for the series continues to be here.


Data Analysis for the Life Sciences - a book completely written in R markdown

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The book Data Analysis for the Life Sciences is now available on Leanpub.

title_pageData analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.  We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

 The text was completely written in R markdown and every section contains a link to the  document that was used to create that section. This means that you can use knitr to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from  GitHub. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions.  But you can see the solutions if  you take our online course quizzes. If we find that most readers want access to the solutions, we will open them up as well.

The material is based on the online courses I have been teaching with Mike Love. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used jekyll to create a webpage with html versions of the markdown documents. Jeff then convinced us to publish it on LeanbupLeanpub. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it.


The Leek group guide to writing your first paper

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I have written guides on reviewing papers, sharing data,  and writing R packages. One thing I haven't touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest transition in graduate school is between taking classes and doing research.

There are several hard parts to this transition including trying to find a problem, trying to find an advisor, and having a ton of unstructured time. One of the hardest things I've found is knowing (a) when to start writing your first paper and (b) how to do it. So I wrote a guide for students in my group:

On how to write your first paper. It might be useful for other folks as well so I put it up on Github. Just like with the other guides I've written this is a very opinionated (read: doesn't apply to everyone) guide. I also would appreciate any feedback/pull requests people have.


Not So Standard Deviations: The Podcast

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I'm happy to announce that I've started a brand new podcast called Not So Standard Deviations with Hilary Parker at Etsy. Episode 1 "RCatLadies Origin Story" is available through SoundCloud. In this episode we talk about the origins of RCatLadies, evidence-based data analysis, my new book, and the Python vs. R debate.

You can subscribe to the podcast using the RSS feed from SoundCloud. We'll be getting it up on iTunes hopefully very soon.

Download the audio file.

Show Notes:


Interview with COPSS award Winner John Storey

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone



Editor's Note: We are again pleased to interview the COPSS President's award winner. The COPSS Award is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to John Storey who also won the Mortimer Spiegelman award for his outstanding contribution to public health statistics.  This interview is a particular pleasure since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also did the whole interview in markdown and put it under version control at Github so it is fully reproducible. 

SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?

JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my PhD advisor. However, I consider my research group to be a data science group. We have the Venn diagram reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.

SimplyStats: How did you find out you had won the COPSS Presidents’ Award?

JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to give a seminar. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!

SimplyStats: One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?

JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.

As an example, Theorem 1 from Storey (2003) Annals of Statistics was the first result I obtained in my work on false discovery rates. This paper first appeared as a technical report in early 2001, and the results spawned further work on a point estimation approach to false discovery rates, the local false discovery rate, q-value and its application to genomics, and a unified theoretical framework.

Besides false discovery rates, this approach has been useful in my work on the optimal discovery procedure as well as surrogate variable analysis (in particular, Desai and Storey 2012 for surrogate variable analysis).  For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a recent paper of ours on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.

SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?

JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically King Crimson or some variant of metal or ambient – which Simply Statistics co-founder Jeff Leek got to endure enjoy for years during his PhD in my lab.

SimplyStats: You are the founding Director of the Center for Statistics and Machine Learning at Princeton. What parts of the new gig are you most excited about?

JS: Princeton closed its Department of Statistics in the early 1980s. Because of this, the style of statistician and machine learner we have here today is one who’s comfortable being appointed in a field outside of statistics or machine learning. Examples include myself in genomics, Kosuke Imai in political science, Jianqing Fan in finance and economics, and Barbara Engelhardt in computer science. Nevertheless, statistics and machine learning here is strong, albeit too small at the moment (which will be changing soon). This is an interesting place to start, very different from most universities.

What I’m most excited about is that we get to answer the question: “What’s the best way to build a faculty, educate undergraduates, and create a PhD program starting now, focusing on the most important problems of today?”

For those who are interested, we’ll be releasing a public version of our strategic plan within about six months. We’re trying to do something unique and forward-thinking, which will hopefully make Princeton an influential member of the statistics, machine learning, and data science communities.

SimplyStats: You are organizing the Tukey conference at Princeton (to be held September 18, details here). Do you think Tukey’s influence will affect your vision for re-building statistics at Princeton?

JS: Absolutely, Tukey has been and will be a major influence in how we re-build. He made so many important contributions, and his approach was extremely forward thinking and tied into real-world problems. I strongly encourage everyone to read Tukey’s 1962 paper titled The Future of Data Analysis. Here he’s 50 years into the future, foreseeing the rise of data science. This paper has truly amazing insights, including:

For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.

All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.

By and large, the great innovations in statistics have not had correspondingly great effects upon data analysis. . . . Is it not time to seek out novelty in data analysis?

In this regard, another paper that has been influential in how we are re-building is Leo Breiman’s titled Statistical Modeling: The Two Cultures. We’re building something at Princeton that includes both cultures and seamlessly blends them into a bigger picture community concerned with data-driven scientific discovery and technology development.

SimplyStats: What advice would you give young statisticians getting into the discipline now?

JS: My most general advice is don’t isolate yourself within statistics. Interact with and learn from other fields. Work on problems that are important to practitioners of science and technology development. I recommend that students should master both “traditional statistics” and at least one of the following: (1) computational and algorithmic approaches to data analysis, especially those more frequently studied in machine learning or data science; (2) a substantive scientific area where data-driven discovery is extremely important (e.g., social sciences, economics, environmental sciences, genomics, neuroscience, etc.). I also recommend that students should consider publishing in scientific journals or computer science conference proceedings, in addition to traditional statistics journals. I agree with a lot of the constructive advice and commentary given on the Simply Statistics blog, such as encouraging students to learn about reproducible research, problem-driven research, software development, improving data analyses in science, and outreach to non-statisticians. These things are very important for the future of statistics.


The Next National Library of Medicine Director Can Help Define the Future of Data Science

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The main motivation for starting this blog was to share our enthusiasm about the increased importance of data and data analysis in science, industry, and society in general. Based on recent initiatives, such as BD2k, it is clear that the NIH is also enthusiastic and very much interested in supporting data science. For those that don't know, the National Institutes of Health (NIH) is the largest public funder of biomedical research in the world. This federal agency has an annual budget of about $30 billion.

The NIH has several institutes, each with its own budget and capability to guide funding decisions. Currently, the missions of most of these institutes relate to a specific disease or public health challenge.  Many of them fund research in statistics and computing because these topics are important components of achieving their specific mission. Currently, however, there is no institute directly tasked with supporting data science per se. This is about to change.

The National Library of Medicine (NLM) is one of the few NIH institutes that is not focused on a particular disease or public health challenge. Apart from the important task of maintaining an actual library, it supports, among many other initiatives, indispensable databases such as PubMed, GeneBank and GEO. After over 30 years of successful service as NLM director, Dr. Donald Lindberg stepped down this year and, as is customary, an advisory board was formed to advice the NIH on what's next for NLM. One of the main recommendations of the report is the following:

NLM  should be the intellectual and programmatic epicenter for data science at NIH and stimulate its advancement throughout biomedical research and application.

Data science features prominently throughout the report making it clear the NIH is very much interested in further supporting this field. The next director can therefore have an enormous influence in the futre of data science. So, if you love data, have administrative experience, and a vision about the future of data science as it relates to the medical and related sciences, consider this exciting opportunity.

Here is the ad.





Interview with Sherri Rose and Laura Hatfield

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone


Sherri Rose and Laura Hatfield

Rose/Hatfield © Savannah Bergquist

Laura Hatfield and Sherri Rose are Assistant Professors specializing in biostatistics at Harvard Medical School in the Department of Health Care Policy. Laura received her PhD in Biostatistics from the University of Minnesota and Sherri completed her PhD in Biostatistics at UC Berkeley. They are developing novel statistical methods for health policy problems.

SimplyStats: Do you consider yourselves statisticians, data scientists, machine learners, or something else?

Rose: I’d definitely say a statistician. Even when I'm working on things that fall into the categories of data science or machine learning, there's underlying statistical theory guiding that process, be it for methods development or applications. Basically, there's a statistical foundation to everything I do.

Hatfield: When people ask what I do, I start by saying that I do research in health policy. Then I say I’m a statistician by training and I work with economists and physicians. People have mistaken ideas about what a statistician or professor does, so describing my context and work seems more informative. If I’m at a party, I usually wrap it up in a bow as, “I crunch numbers to study how Obamacare is working.” [laughs]


SimplyStats: What is the Health Policy Data Science Lab? How did you decide to start that?

Hatfield: We wanted to give our trainees a venue to promote their work and get feedback from their peers. And it helps me keep up on the cool projects Sherri and her students are working on.

Rose: This grew out of us starting to jointly mentor trainees. It's been a great way for us to make intellectual contributions to each other’s work through Lab meetings. Laura and I approach statistics from completely different frameworks, but work on related applications, so that's a unique structure for a lab.


SimplyStats: What kinds of problems are your groups working on these days? Are they mostly focused on health policy?

Rose: One of the fun things about working in health policy is that it is quite expansive. Statisticians can have an even bigger impact on science and public health if we take that next step: thinking about the policy implications of our research. And then, who needs to see the work in order to influence relevant policies. A couple projects I’m working on that demonstrate this breadth include a machine learning framework for risk adjustment in insurance plan payment and a new estimator for causal effects in a complex epidemiologic study of chronic disease. The first might be considered more obviously health policy, but the second will have important policy implications as well.

Hatfield: When I start an applied collaboration, I’m also thinking, “Where is the methods paper?” Most of my projects use messy observational data, so there is almost always a methods paper. For example, many studies here need to find a control group from an administrative data source. I’ve been keeping track of challenges in this process. One of our Lab students is working with me on a pathological case of a seemingly benign control group selection method gone bad. I love the creativity required in this work; my first 10 analysis ideas may turn out to be infeasible given the data, but that’s what makes this fun!


SimplyStats: What are some particular challenges of working with large health data?

Hatfield: When I first heard about the huge sample sizes, I was excited! Then I learned that data not collected for research purposes...

Rose: This was going to be my answer!

Hatfield: ...are very hard to use for research! In a recent project, I’ve been studying how giving people a tool to look up prices for medical services changes their health care spending. But the data set we have leaves out [painful pause] a lot of variables we’d like to use for control group selection and... a lot of the prices. But as I said, these gaps in the data are begging to be filled by new methods.

Rose: I think the fact that we have similar answers is important. I’ve repeatedly seen “big data” not have a strong signal for the research question, since they weren’t collected for that purpose. It’s easy to get excited about thousands of covariates in an electronic health record, but so much of it is noise, and then you end up with an R2 of 10%. It can be difficult enough to generate an effective prediction function, even with innovative tools, let alone try to address causal inference questions. It goes back to basics: what’s the research question and how can we translate that into a statistical problem we can answer given the limitations of the data.

SimplyStats: You both have very strong data science skills but are in academic positions. Do you have any advice for students considering the tradeoff between academia and industry?

Hatfield: I think there is more variance within academia and within industry than between the two.

Rose: Really? That’s surprising to me...

Hatfield: I had stereotypes about academic jobs, but my current job defies those.

Rose: What if a larger component of your research platform included programming tools and R packages? My immediate thought was about computing and its role in academia. Statisticians in genomics have navigated this better than some other areas. It can surely be done, but there are still challenges folding that into an academic career.

Hatfield: I think academia imposes few restrictions on what you can disseminate compared to industry, where there may be more privacy and intellectual property concerns. But I take your point that R packages do not impress most tenure and promotion committees.

Rose: You want to find a good match between how you like spending your time and what’s rewarded. Not all academic jobs are the same and not all industry jobs are alike either. I wrote a more detailed guest post on this topic for Simply Statistics.

Hatfield: I totally agree you should think about how you’d actually spend your time in any job you’re considering, rather than relying on broad ideas about industry versus academia. Do you love writing? Do you love coding? etc.


SimplyStats: You are both adopters of social media as a mechanism of disseminating your work and interacting with the community. What do you think of social media as a scientific communication tool? Do you find it is enhancing your careers?

Hatfield: Sherri is my social media mentor!

Rose: I think social media can be a useful tool for networking, finding and sharing neat articles and news, and putting your research out there to a broader audience. I’ve definitely received speaking invitations and started collaborations because people initially “knew me from Twitter.” It’s become a way to recruit students as well. Prospective students are more likely to “know me” from a guest post or Twitter than traditional academic products, like journal articles.

Hatfield: I’m grateful for our Lab’s new Twitter because it’s a purely academic account. My personal account has been awkwardly transitioning to include professional content; I still tweet silly things there.

Rose: My timeline might have a cat picture or two.

Hatfield: My very favorite thing about academic Twitter is discovering things I wouldn’t have even known to search for, especially packages and tricks in R. For example, that’s how I got converted to tidy data and dplyr.

Rose: I agree. I think it’s a fantastic place to become exposed to work that’s incredibly related to your own but in another field, and you wouldn’t otherwise find it preparing a typical statistics literature review.


SimplyStats: What would you change in the statistics community?

Rose: Mentoring. I was tremendously lucky to receive incredible mentoring as a graduate student and now as a new faculty member. Not everyone gets this, and trainees don’t know where to find guidance. I’ve actively reached out to trainees during conferences and university visits, erring on the side of offering too much unsolicited help, because I feel there’s a need for that. I also have a resources page on my website that I continue to update. I wish I had a more global solution beyond encouraging statisticians to take an active role in mentoring not just your own trainees. We shouldn’t lose good people because they didn’t get the support they needed.

Hatfield: I think we could make conferences much better! Being in the same physical space at the same time is very precious. I would like to take better advantage of that at big meetings to do work that requires face time. Talks are not an example of this. Workshops and hackathons and panels and working groups -- these all make better use of face-to-face time. And are a lot more fun!



If you ask different questions you get different answers - one more way science isn't broken it is just really hard

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

If you haven't already read the amazing piece by Christie Aschwanden on why Science isn't Broken you should do so immediately. It does an amazing job of capturing the nuance of statistics as applied to real data sets and how that can be misconstrued as science being "broken" without falling for the easy "everything is wrong" meme.

One thing that caught my eye was how the piece highlighted a crowd-sourced data analysis of soccer red cards. The key figure for that analysis is this one:


I think the figure and underlying data for this figure are fascinating in that they really highlight the human behavioral variation in data analysis and you can even see some data analysis subcultures emerging from the descriptions of how people did the analysis and justified or not the use of covariates.

One subtlety of the figure that I missed on the original reading is that not all of the estimates being reported are measuring the same thing. For example, if some groups adjusted for the country of origin of the referees and some did not, then the estimates for those two groups are measuring different things (the association conditional on country of origin or not, respectively). In this case the estimates may be different, but entirely consistent with each other, since they are just measuring different things.

If you ask two people to do the analysis and you only ask them the simple question: Are referees more likely to give  red cards to dark skinned players? then you may get a different answer based on those two estimates. But the reality is the answers the analysts are reporting are actually to the questions:

  1. Are referees more likely to give  red cards to dark skinned players holding country of origin fixed?
  2. Are referees more likely to give  red cards to dark skinned players averaging over country of origin (and everything else)?

The subtlety lies in the fact that changes to covariates in the analysis are actually changing the hypothesis you are studying.

So in fact the conclusions in that figure may all be entirely consistent after you condition on asking the same question. I'd be interested to see the same plot, but only for the groups that conditioned on the same set of covariates, for example. This is just one more reason that science is really hard and why I'm so impressed at how well the FiveThirtyEight piece captured this nuance.




P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:


One of the most popular ways to correct for multiple testing is to estimate or control the false discovery rate. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold t significant, then borrowing notation from this great introduction to false discovery rates 



So F(t) is the (unknown) total number of null hypotheses called significant and S(t) is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.




To get an estimate of the FDR we just need an estimate for  E[F(t)]  and E[S(t)]. The latter is pretty easy to estimate as just the total number of rejections (the number of p < t). If you assume that the p-values follow the expected distribution then E[F(t)]  can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by t since the p-values are uniform. To do this, we need an estimate for \pi_0, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.

Combining the above equation with our estimates for E[F(t)]  and E[S(t)] we get:




The q-value is a multiple testing analog of the p-value and is defined as:



This is of course a very loose version of this and you can get a more technical description here. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its "testing partners". Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.


Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:



Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.

A couple of things to note:

  • Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn't be done.
  • For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.
  • This is the reason that sometimes a "multiple testing adjusted" p-value (or q-value) is smaller than the p-value itself.
  • This doesn't affect non-adaptive FDR procedures - but those procedures still depend on the "testing partners" of any p-value through the total number of tests performed. This is why people talk about the so-called "multiple testing burden". But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.
  • I've appended the code to generate the histograms and calculate the q-values in this post in the following gist.