Category: Uncategorized


The statistics identity crisis: am I really a data scientist?






Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. 


I organized a session at JSM 2015 called "The statistics identity crisis: am I really a data scientist?" The session turned out to be pretty popular:

but it turns out not everyone fit in the room:

Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn't be there. So we will be hosting a Google Hangout with the following talks:

'Am I a Data Scientist?': The Applied Statistics Student's Identity CrisisAlyssa Frazee, Stripe
How Industry Views Data Science Education in Statistics DepartmentsChris Volinsky, AT&T
Evaluating Data Science Contributions in Teaching and ResearchLance Waller, Emory University
Teach Data Science and They Will ComeJennifer Bryan, The University of British Columbia

You can watch it on Youtube or Google Plus. Here is the link:

The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag #JSM2015 or you can watch later as the video will remain on Youtube.


We need a statistically rigorous and scientifically meaningful definition of replication

Replication and confirmation are indispensable concepts that help define scientific facts.  However, the way in which we reach scientific consensus on a given finding is rather complex. Although some press releases try to convince us otherwise, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made.  These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question.  This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans.  In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.

Regulatory agencies, such as the FDA, are an exception since they clearly spell out a definition of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.

In response to a growing concern over a reproducibility crisis, projects such as the Open Science Collaboration have commenced to systematically try to replicate published results. In a recent post, Jeff described one of their recent papers on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value < 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.

To further illustrate this, let's consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values <0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p<0.05). So which drug do you take? Let's give a bit more information to help you decide. Here are the p-values for both original and replication trials:

Drug Original Replication Replicated
A 0.0001 0.001 Yes
B <0.000001 0.03 Yes
C 0.03 0.06 No
D <0.000001 0.10 No

Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?


I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.

It seems that before continuing the debate over replication, and certainly before declaring that we are in a reproducibility crisis, we need a statistically rigorous and scientifically meaningful definition of replication.  This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.

Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.


Theranos runs head first into the realities of diagnostic testing

The Wall Street Journal has published a lengthy investigation into the diagnostic testing company Theranos.

The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than $400 million into Theranos, valuing it at $9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple Inc. cofounder Steve Jobs.

If ever there were a warning sign, the comparison to Steve Jobs has got to be it.

But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.

One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal.
In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said.
With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons.

Minimal R Package Check List

A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don't lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you're in the air.

That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio's tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.

The list is basically a "pre-flight" list---the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don't run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.

Pre-flight Procedures for R Packages

  1. Install most recent version of R
  2. Install most recent version of RStudio
  3. Open RStudio
  4. Install devtools package
  5. Click on Project --> New Project... --> New Directory --> R package
  6. Enter package name
  7. Delete boilerplate code and "hello.R" file
  8. Goto "man" directory an delete "hello.Rd" file
  9. In File browser, click on package name to go to the top level directory
  10. Click "Build" tab in environment browser
  11. Click "Configure Build Tools..."
  12. Check "Generate documentation with Roxygen"
  13. Check "Build & Reload" when Roxygen Options window opens --> Click OK
  14. Click OK in Project Options window

At this point, you're clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.

If I'm missing a step or have too many steps, I'd like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.

UPDATE: I've made some changes to the check list and will be posting future updates/modifications to my GitHub repository.


Profile of Data Scientist Shannon Cebron

The "This is Statistics" campaign has a nice profile of Shannon Cebron, a data scientist working at the Baltimore-based Pegged Software.

What advice would you give to someone thinking of a career in data science?

Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science.  I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.


A glass half full interpretation of the replicability of psychological science

tl;dr: 77% of replication effects from the psychology replication study were in (or above) the 95% prediction interval based on the original effect size. This isn't perfect and suggests (a) there is still room for improvement, (b) the scientists who did the replication study are pretty awesome at replicating, (c) we need a better definition of replication that respects uncertainty but (d) the scientific sky isn't falling. We wrote this up in a paper on arxiv; the code is here. 

A week or two ago a paper came out in Science on Estimating the reproducibility of psychological science. The basic behind the study was to take a sample of studies that appeared in a particular journal in 2008 and try to replicate each of these studies. Here I'm using the definition that reproducibility is the ability to recalculate all results given the raw data and code from a study and replicability is the ability to re-do the study and get a consistent result. 

The paper is pretty incredible and the authors did an amazing job of going back to the original sources and trying to be faithful to the original study designs. I have to admit when I first heard about the study design I was incredibly pessimistic about the results (I suppose grouchy is a natural default state for many statisticians –especially those with sleep deprivation). I mean 2008 was well before the push toward reproducibility had really taken off (Biostatistics was one of the first journals to adopt a policy on reproducible research and that didn't happen until 2009). More importantly, the student researchers from those studies had possibly moved on, study populations may change, there could be any number of minor variations in the study design and so forth. I thought the chances of getting any effects in the same range was probably pretty low. 

So when the results were published I was pleasantly surprised. I wasn’t the only one:

But that was definitely not the prevailing impression that the paper left on social and mass media. A lot of the discussion around the paper focused on the idea that only 36% of the studies had a p-value less than 0.05 in both the original and replication study. But many of the sample sizes were small and the effects were modest. So the first question I asked myself was, "Well what would we expect to happen if we replicated these studies?" The original paper measured replicability in several ways and tried hard to calibrate expected coverage of confidence intervals for the measured effects.

With Roger and Prasad we tried a little different approach. We estimated the 95% prediction interval for the replication effect given the original effect size.



72% of the replication effects were within the 95% prediction interval and 2 were above the interval (showed a stronger signal in replication in than predicted from original study). This definitely shows that there is still room for improvement in replication of these studies - we would expect 95% of the effects to fall into the 95% prediction interval. But at least my opinion is that 72% (or 77% if you count the 2 above the P.I.) of studies falling in the prediction interval is (a) not bad and (b) a testament to the authors of the reproducibility paper and their efforts to get the studies right.

An important point here is that replication and reproducibility aren't the same thing. When reproducing a study we expect the numbers and figures to be exactly the same. But a replication involves recollection of data and is subject to variation and so we don't expect the answer to be exactly the same in the replication. This is of course made more confusing by regression to the mean, publication bias, and the garden of forking paths.  Our use of a prediction interval measures both the variation expected in the original study and in the replication. One thing we noticed when re-analyzing the data is how many of the studies had very low sample sizes. samplesize_figure_nofilter


Sample sizes were generally bigger in the replication, but often very low regardless. This makes it more difficult to disentangle what didn't replicate from what is just expected variation for a small sample size study.  The point remains whether those small studies should be trusted in general, but for the purposes of measuring replication it makes the problem more difficult.

One thing I have been thinking about a lot and this study drove home is that if we are measuring replication we need a definition that incorporates uncertainty directly. Suppose that you collect a data set D0 from an original study and  D1 from a replication. Then replication means that the data from a study replicates if D0 ~ F and D1 ~ F. Informally, if the data are generated from the same distribution in both experiments then the study replicates. To get an estimate you apply a pipeline to the data set to get an estimate e0 = p(D0). If the study is also reproducible than p() is the same for both studies and p(D0) ~ G and p(D1) ~ G, subject to some conditions on p(). 

One interesting consequence of this definition is that each complete replication data set represents only a single data point for measuring replication. To measure replication with this definition you either need to make assumptions about the data generating distribution for D0 and D1 or you need to perform a complete replication of a study many times to determine if it replicates. However, it does mean that we can define replication even for studies with very small number of replicates as the data generating distribution may be arbitrarily variable in each case.

Regardless of this definition I was excited that the OSF folks did the study and pulled it off as well as they did and was a bit bummed about the most common  reaction. I think there is an easy narrative that "science is broken" which I think isn't a positive thing for a number of reasons. I love the way that {reproducibility/replicability/open science/open publication} are becoming more and more common, but often think we fall into the same trap in wanting to report these results as clear cut as we do when reporting exaggerations or oversimplifications of scientific discoveries in headlines. I'm excited to see how these kinds of studies look in 10 years when Github/open science/pre-prints/etc. are all the standards.


Apple Music's Moment of Truth

Today is the day when Apple, Inc. learns whether it's brand new streaming music service, Apple Music, is going to be a major contributor to the bottom line or just another streaming service (JASS?). Apple Music launched 3 months ago and all new users are offered a 3-month free trial. Today, that free trial ends and the big question is how many people will start to pay for their subscription, as opposed to simply canceling it. My guess is that most people (> 50%) will opt to pay, but that's a complete guess. For what it's worth, I'll be paying for my subscription. After adding all this music to my library, I'd hate to see it all go away.

Back on August 18, 2015, consumer market research firm MusicWatch released a study that claimed, among other things, that

Among people who had tried Apple Music, 48 percent reported they are not currently using the service.

This would suggest that almost half of people who had signed up for the free trial period of Apple Music were not interested in using it further and would likely not pay for it once the trial ended. If it were true, it would be a blow to the newly launched service.

But how did MusicWatch arrive at its number? It claimed to have surveyed 5,000 people in its study. Shortly before the survey by MusicWatch was released, Apple claimed that about 11 million people had signed up for their new Apple Music service (because the service had just launched, everyone who had signed up was in the free trial period). Clearly, 5,000 people do not make up the entire population, so we have but a small sample of users.

What is the target that MusicWatch was trying to answer? It seems that they wanted to know the percentage of all people who had signed up for Apple Music that were still using the service. Can they make inference about the entire population from the sample of 5,000?

If the sample is representative and the individuals are independent, we could use the number 48% as an estimate of the percentage in the population who no longer use the service. The press release from MusicWatch did not indicate any measure of uncertainty, so we don't know how reliable the number is.

Interestingly, soon after the MusicWatch survey was released, Apple released a statement to the publication The Verge, stating that 79% of users who had signed up were still using the service (i.e. only 21% had stopped using it, as opposed to 48% reported by MusicWatch). In other words, Apple just came out and gave us the truth! This was unusual because Apple typically does not make public statements about newly launched products. I just found this amusing because I've never been in a situation where I was trying to estimate a parameter and then someone later just told me what its value was.

If we believe that Apple and MusicWatch were measuring the same thing in their analyses (and it's not clear that they were), then it would suggest that MusicWatch's estimate of the population percentage (48%) was quite far off from the true value (21%). What would explain this large difference?

  1. Random variation. It's true that MusicWatch's survey was a small sample relative to the full population, but the sample was still big with 5,000 people. Furthermore, the analysis was fairly simple (just taking the proportion of users still using the service), so the uncertainty associated with that estimate is unlikely to be that large.
  2. Selection bias. Recall that it's not clear how MusicWatch sampled its respondents, but it's possible that the way that they did it led them to capture a set of respondents who were less inclined to use Apple Music. Beyond this, we can't really say more without knowing the details of the survey process.
  3. Respondents are not independent. It's possible that the survey respondents are not independent of each other. This would primiarily affect the uncertainty about the estimate, making it larger than we might expect if the respondents were all independent. However, since we do not know what MusicWatch's uncertainty about their estimate was in the first place, it's difficult to tell if dependence between respondents could play a role. Apple's number, of course, has no uncertainty.
  4. Measurement differences. This is the big one, in my opinion. We don't know is how either MusicWatch or Apple defined "still using the service". You could imagine a variety of ways to determine whether a person was still using the service. You could ask "Have you used it in the last week?" or perhaps "Did you use it yesterday?" Responses to these questions would be quite different and would likely lead to different overall percentages of usage.

We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15

You can sign up following links here

Last semester we successfully ran version 2 of my Data Analysis course. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:

  • We no longer use R packages that Microsoft Windows users had trouble installing in the first course.
  • All courses are now designed to be completed in 4 weeks.
  • We added new assessment questions.
  • We improved the assessment questions determined to be problematic.
  • We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.
  • We consolidated the case studies into one course.
  • We combined the materials from the statistics courses into a book, which you can download here. The material in the book match the materials taught in class so you can use it to follow along.

You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.

  1. Statistics and R for the Life Sciences starts October 15.
  2. Introduction to Linear Models and Matrix Algebra starts November 15.
  3. Statistical Inference and Modeling for High-throughput Experiments starts December 15.
  4. High-Dimensional Data Analysis starts January 15.
  5. Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays starts February 15.
  6. High-performance Computing for Reproducible Genomics starts March 15.
  7. Case Studies in Functional Genomics start April 15.

The landing page for the series continues to be here.


Data Analysis for the Life Sciences - a book completely written in R markdown

The book Data Analysis for the Life Sciences is now available on Leanpub.

title_pageData analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.  We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

 The text was completely written in R markdown and every section contains a link to the  document that was used to create that section. This means that you can use knitr to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from  GitHub. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions.  But you can see the solutions if  you take our online course quizzes. If we find that most readers want access to the solutions, we will open them up as well.

The material is based on the online courses I have been teaching with Mike Love. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used jekyll to create a webpage with html versions of the markdown documents. Jeff then convinced us to publish it on LeanbupLeanpub. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it.


The Leek group guide to writing your first paper

I have written guides on reviewing papers, sharing data,  and writing R packages. One thing I haven't touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest transition in graduate school is between taking classes and doing research.

There are several hard parts to this transition including trying to find a problem, trying to find an advisor, and having a ton of unstructured time. One of the hardest things I've found is knowing (a) when to start writing your first paper and (b) how to do it. So I wrote a guide for students in my group:

On how to write your first paper. It might be useful for other folks as well so I put it up on Github. Just like with the other guides I've written this is a very opinionated (read: doesn't apply to everyone) guide. I also would appreciate any feedback/pull requests people have.