Simply Statistics


AAAS S&T Fellows for Big Data and Analytics

Thanks to Steve Pierson of the ASA for letting us know that the AAAS Science and Technology Fellowship program has a new category for "Big Data and Analytics". For those not familiar, AAAS organizes the S&T Fellowship program to get scientists involved in the policy-making process in Washington and at the federal agencies. In general, the requirements for the program are

Applicants must have a PhD or an equivalent doctoral-level degree at the time of application. Individuals with a master's degree in engineering and at least three years of post-degree professional experience also may apply. Some programs require additional experience. Applicants must be U.S. citizens. Federal employees are not eligible for the fellowships.

Further details are on the AAAS web site.

I've met a number of current and former AAAS fellows working on Capitol Hill and at the various agencies and I have to say I've been universally impressed. I personally think getting more scientists into the federal government and involved with the policy-making process is a Good Thing. If you're a statistician looking to have a different kind of impact, this might be for you.


The return of the stat - Computing for Data Analysis & Data Analysis back on Coursera!

It's the return of the stat. Roger and I are going to be re-offering our Coursera courses:

Computing for Data Analysis (starts Sept 23)

Sign up here.

Data Analysis (starts Oct 28)

Sign up here.


Evidence-based Data Analysis: Treading a New Path for Reproducible Research (Part 2)

Last week I posted about how I thought the notion of reproducible research did not go far enough to address the question of whether you could trust that a given data analysis was conducted appropriately. From some of the discussion on the post, it seems some of you thought I believed therefore that reproducibility had no value. That’s definitely not true and I’m hoping I can clarify my thinking in this followup post.

Just to summarize a bit from last week, one key problem I find with requiring reproducibility of a data analysis is that it comes only at the most “downstream” part of the research process, the post-publication part. So anything that was done incorrectly has already happened and the damage has been done to the analysis. Having code and data available, importantly, makes it possible to discover these problems, but only after the fact. I think this results in two problems: (1) It may take a while to figure out what exactly the problems are (even with code/data) and how to fix them; and (2) the problems in the analysis may have already caused some sort of harm.

Open Source Science?

For the first problem, I think a reasonable analogy for reproducible research is open source software. There the idea is that source code is available for all computer programs so that we can inspect and modify how a program runs. With open source software “all bugs are shallow”. But the key here is that as long as all programmers have the requisite tools, they can modify the source code on their own, publish their corrected version (if they are fixing a bug), others can review it and accept or modify, and on and on. All programmers are more or less on the same footing, as long as they have the ability to hack the code. With distributed source code management systems like git, people don’t even need permission to modify the source tree. In this environment, the best idea wins.

The analogy with open source software breaks down a bit with scientific research because not all players are on the same footing. Typically, the original investigator is much better equipped to modify the “source code”, in this case the data analysis, and to fix any problems. Some types of analyses may require tremendous resources that are not available to all researchers. Also, it might take a long time for others who were not involved in the research, to fully understand what is going on and how to make reasonable modifications. That may involve, for example, learning the science in the first place, or learning how to program a computer for that matter. So I think making changes to a data analysis and having them accepted is a slow process in science, much more so than with open source software. There are definitely things we can do to improve our ability to make rapid changes/updates, but the implementation of those changes are only just getting started.

First Do No Harm

The second problem, that some sort of harm may have already occurred before an analysis can be fully examined is an important one. As I mentioned in the previous post, merely stating that an analysis is reproducible doesn’t say a whole lot about whether it was done correctly. In order to verify that, someone knowledgeable has to go into the details and muck around to see what is going on. If someone is not available to do this, then we may never know what actually happened. Meanwhile, the science still stands and others may build off of it.

In the Duke saga, one of the most concerning aspects of the whole story was that some of Potti’s research was going to be used to guide therapy in a clinical trial. The fact that a series of flawed data analyses was going to be used as the basis of choosing what cancer treatments people were going to get was very troubling. In particular, one of these flawed analyses reversed the labeling of the cancer and control cases!

To me, it seems that waiting around for someone like Keith Baggerly to come around and spend close to 2,000 hours reproducing, inspecting, and understanding a series of analyses is not an efficient system. In particular, when actual human lives may be affected, it would be preferable if the analyses were done right in the first place, without the “statistics police” having to come in and check that everything was done properly.

Evidence-based Data Analysis

What I think the statistical community needs to invest time and energy into is what I call “evidence-based data analysis”. What do I mean by this? Most data analyses are not the simple classroom exercises that we’ve all done involving linear regression or two-sample t-tests. Most of the time, you have to obtain the data, clean that data, remove outliers, impute missing values, transform variables and on and on, even before you fit any sort of model. Then there’s model selection, model fitting, diagnostics, sensitivity analysis, and more. So a data analysis is really pipeline of operations where the output of one stage becomes the input of another.

The basic idea behind evidence-based data analysis is that for each stage of that pipeline, we should be using the best method, justified by appropriate statistical research that provides evidence favoring one method over another. If we cannot reasonable agree on a best method for a given stage in the pipeline, then we have a gap that needs to be filled. So we fill it!

Just to clarify things before moving on too far, here’s a simple example.

Evidence-based Histograms

Consider the following simple histogram.


The histogram was created in R by calling hist(x) on some Normal random deviates (I don’t remember the seed so unfortunately it is not reproducible). Now, we all know that a histogram is a kind of smoother, and with any smoother, the critical parameter is the smoothing parameter or the bandwidth. Here, it’s the size of the bin or the number of bins.

Notice that when I call ‘hist’ I don’t actually specify the number of bins. Why not? Because in R, the default is to use Sturges’ formula for the number of bins. Where does that come from? Well, there is a paper in the Journal of the American Statistical Association in 1926 by H. A. Sturges that justifies why such a formula is reasonable for a histogram (it is a very short paper, those were the days). R provides other choices for choosing the number of bins. For example, David Scott wrote a paper in Biometrika that justified bandwith/bin size based in integrated mean squared error criteria.

The point is that R doesn’t just choose the default number of bins willy-nilly, there’s actual research behind that choice and evidence supporting why it’s a good choice. Now, we may not all agree that this default is the best choice at all times, but personally I rarely modify the default number of bins. Usually I just want to get a sense of what the distribution looks like and the default is fine. If there's a problem, transforming the variable somehow often is more productive than modifying the number of bins. What's the best transformation? Well, it turns out there's research on that too.

Evidence-based Reproducible Research

Now why can’t we extend the idea behind the histogram bandwidth to all data analysis? I think we can. For every stage of a given data analysis pipeline, we can have the “best practices” and back up those practices with statistical research. Of course it’s possible that such best practices have not yet been developed. This is common in emerging areas like genomics where the data collection technology is constantly changing. That’s fine, but in more mature areas, I think it’s possible for the community to agree on a series of practices that work, say, 90% of the time.

There are a few advantages to evidence-based reproducible research.

  1. It reduces the “researcher degrees of freedom”. Researchers would be disincentivized from choosing the method that produces the “best” results if there is already a generally agreed upon approach. If a given data analysis required a different approach, the burden would be on the analyst to justify why a deviation from the generally accepted approach was made.
  2. The methodology would be transparent because the approach would have been vetted by the community. I call this "transparent box" analysis, as opposed to black box analysis. The analysis would be transparent so you would know exactly what is going on, but it would "locked in a box" so that you couldn't tinker with it to game the results.
  3. You would not have the lonely data analyst coming up with their own magical method to analyze the data. If a researcher claimed to have conducted an analysis using an evidence-based pipeline, you could at least have a sense that something reasonable was done. You would still need reproducibility to ensure that the researcher was not misrepresenting him/herself, but now we would have two checks on the analysis, not just one.
  4. Most importantly, evidence-based reproducible research attacks the furthest upstream aspect of the research, which is the analysis itself. It guarantees that generally accepted approaches are used to analyze the data from the very beginning and hopefully prevents problems from occurring rather than letting them propagate through the system.

What can we do to bring evidence-based data analysis practices to all of the sciences? I’ll write about what I think we can do in the next post.


Interview with Ani Eloyan and Betsy Ogburn

Jeff and I interview Ani Eloyan and Betsy Ogburn, two new Assistant Professors in the Department of Biostatistics here.

Jeff and I talk to Ani and Betsy about their research interests and finally answer the burning question: "What is the future of statistics?"


Statistics meme: Sad p-value bear

Sad p-value bear wishes you had a bigger sample size.

I was just at a conference where the idea for a sad p-value bear meme came up (in the spirit of Biostatistics Ryan Gosling). This should not be considered an endorsement of p-values or p-value hacking.


Did Faulty Software Shut Down the NASDAQ?

This past Thursday, the NASDAQ stock exchange shut down for just over 3 hours due to some technical problems. It's still not clear what the problem was because NASDAQ officials are being tight-lipped. NASDAQ has had a bad run of problems recently, the most visible was the botching of the Facebook initial public offering.

Stock trading these days is a highly technical business involving complex algorithms and multiple exchanges spread across the country. Poorly coded software or just plain old bugs have the potential to take down an entire exchange and paralyze parts of the financial system for hours.

Mary Jo White, the Chairman of the SEC is apparently getting involved.

Thursday evening, Ms. White said in a statement that the paralysis at the Nasdaq was “serious and should reinforce our collective commitment to addressing technological vulnerabilities of exchanges and other market participants.”

She said she would push ahead with recently proposed rules that would add testing requirements and safeguards for trading software. So far, those rules have faced resistance from the exchange companies. Ms. White said that she would “shortly convene a meeting of the leaders of the exchanges and other major market participants to accelerate ongoing efforts to further strengthen our markets.”

Having testing requirements for trading software is an interesting idea. It's easy to see why the industry would be against it. Trading is a fast moving business and my guess is software is updated/modified constantly to improve performance or to provide people and edge. If you had to get approval or run a bunch of tests every time you wanted to deploy something, you'd quickly get behind the curve.

But is there an issue of safety here? If a small bug in the computer code on which the exchange relies can take down the entire system for hours, isn't that a problem of "financial safety"? Other problems, like the notorious "flash crash" of 2010 where the Dow Jones Industrial Average dropped 700 points in minutes, have the potential to affect regular people, not just hedge fund traders.

It's not unprecedented to subject computer code to higher scrutiny. Code that flies airplanes or runs air-traffic control systems is all tested and reviewed rigorously before being put into production and I think most people would consider that reasonable. Are financial markets the next area? What about scientific research?


Stratifying PISA scores by poverty rates suggests imitating Finland is not necessarily the way to go for US schools

For the past several years a steady stream of articles and opinion pieces have been praising the virtues of Finish schools and exalting the US to imitate this system. One data point supporting this view comes from the most recent PISA scores (2009) in which Finland outscored the US 536 to 5o0. Several people have pointed out that this is an apples (huge and diverse) to oranges (small and homogeneous) comparison. One of the many differences that makes the comparison complicated is that Finland has less students living in poverty ( 3%) than the US (20%). This post defending US public school teachers makes this point with data. Here I show these data in graphical form. The plot on the left shows PISA scores versus the percent of students living in poverty for several countries. There is a pattern suggesting that higher poverty rates are associated with lower PISA scores. In the plot on the right, US schools are stratified by % poverty (orange points). The regression line is the same. Some countries are added (purple) for comparative purposes (the post does not provide their poverty rates).   Note that US school with poverty rates comparable to Finland's (below 10%) outperform Finland and schools in the 10-24% range aren't far behind. So why should these schools change what they are doing? Schools with poverty rates above 25% are another story. Clearly the US has lots of work to do in trying to improve performance in these schools,  but is it safe to assume that Finland's system would work for these student populations?


Note that I scraped data from this post and not the original source.


If you are near DC/Baltimore, come see Jeff talk about Coursera

I'll be speaking at the Data Science Maryland meetup. The title of my presentation is "Teaching Data Science to the Masses". The talk is at 6pm on Thursday, Sept. 19th. More info here.


Chris Lane, U.S. tourism boycotts, and large relative risks on small probabilities

Chris Lane was tragically killed (link via Leah J.) in a shooting in Duncan, Oklahoma. According to the reports, it sounds like it was apparently a random and completely senseless act of violence. It is horrifying to think that those kids were just looking around for someone to kill because they were bored.

Gun violence in the U.S. is way too common and I'm happy about efforts to reduce the chance of this type of event. But I noticed this quote in the above linked CNN article from the former prime minister of Australia, Tim Fischer:

People thinking of going to the USA for business or tourist trips should think carefully about it given the statistical fact you are 15 times more likely to be shot dead in the USA than in Australia per capita per million people.

The CNN article suggests he is calling for a boycott of U.S. tourism. I'm guessing he got his data from a table like this. According to the table, the total firearm related deaths per one million in Australia is 10.6 and in the U.S. 103. So the ratio is something like 10 times. If you restrict to homicides, the rates are 1.3 per million for Australia and 36 per million for the U.S. Here the ratio is almost 36 times.

So the question is, should you boycott the U.S. if you are an Australian tourist? Well, the percentage of people killed in firearm related deaths is 0.0036% in the U.S. and 0.00013% for Australia. So it is incredibly unlikely that you will be killed by a firearm in either country. The issue here is that with small probabilities, you can get huge relative risks, even when both outcomes are very unlikely in an absolute sense. The Chris Lane killing is tragic and horrifying, but I'm not sure a tourism boycott for the purposes of safety is justified.


Treading a New Path for Reproducible Research: Part 1

Discussions about reproducibility in scientific research have been on the rise lately, including on this blog. There are many underlying trends that have produced this increased interest in reproducibility: larger and larger studies being harder to replicate independently, cheaper data collection technologies/methods producing larger datasets, cheaper computing power allowing for more sophisticated analyses (even for small datasets), and the rise of general computational science (for every “X” we now have “Computational X”).

For those that haven’t been following, here’s a brief review of what I mean when I say “reproducibility”. For the most part in science, we focus on what I and some others call “replication”. The purpose of replication is to address the validity of a scientific claim. If I conduct a study and conclude that “X is related to Y”, then others may be encouraged to replicate my study--with independent investigators, data collection, instruments, methods, and analysis--in order to determine whether my claim of “X is related to Y” is in fact true. If many scientists replicate the study and come to the same conclusion, then there’s evidence in favor of the claim’s validity. If other scientists cannot replicate the same finding, then one might conclude that the original claim was false. In either case, this is how science has always worked and how it will continue to work.

Reproducibility, on the other hand, focuses on the validity of the data analysis. In the past, when datasets were small and the analyses were fairly straightforward, the idea of being able to reproduce a data analysis was perhaps not that interesting. But now, with computational science, where data analyses can be extraodinarily complicated, there’s great interest in whether certain data analyses can in fact be reproduced. By this I mean is it possible to take someone’s dataset and come to the same numerical/graphical/whatever output that they came to. While this seems theoretically trivial, in practice it’s very complicated because a given data analysis, which typically will involve a long pipeline of analytic operations, may be difficult to keep track of without proper organization, training, or software.

What Problem Does Reproducibility Solve?

In my opinion, reproducibility cannot really address the validity of a scientific claim as well as replication. Of course, if a given analysis is not reproducible, that may call into question any conclusions drawn from the analysis. However, if an analysis is reproducible, that says practically nothing about the validity of the conclusion or of the analysis itself.

In fact, there are numerous examples in the literature of analyses that were reproducible but just wrong. Perhaps the most nefarious recent example is the Potti scandal at Duke. Given the amount of effort (somewhere close to 2000 hours) Keith Baggerly and his colleagues had to put into figuring out what Potti and others did, I think it’s reasonable to say that their work was not reproducible. But in the end, Baggerly was able to reproduce some of the results--this was how he was able to figure out that the analysis were incorrect. If the Potti analysis had not been reproducible from the start, it would have been impossible for Baggerly to come up with the laundry list of errors that they made.

The Reinhart-Rogoff kerfuffle is another example of analysis that ultimately was reproducible but nevertheless questionable. While Herndon did have to do a little reverse engineering to figure out the original analysis, it was nowhere near the years-long effort of Baggerly and colleagues. However, it was Reinhart-Rogoff’s unconventional weighting scheme (fully reproducible, mind you) that drew all of the attention and strongly influenced the analysis.

I think the key question we want to answer when seeing the results of any data analysis is “Can I trust this analysis?” It’s not possible to go into every data analysis and check everything, even if all the data and code were available. In most cases, we want to have a sense that the analysis was done appropriately (if not optimally). I would argue that requiring that analyses be reproducible does not address this key question.

With reproducibility you get a number of important benefits: transparency, data and code for others to analyze, and an increased rate of transfer of knowledge. These are all very important things. Data sharing in particular may be important independent of the need to reproduce a study if others want to aggregate datasets or do meta-analyses. But reproducibility does not guarantee validity or correctness of the analysis.

Prevention vs. Medication

One key problem with the notion of reproducibility is the point in the research process at which we can apply it as an intervention. Reproducibility plays a role only in the most downstream aspect of the research process--post-publication. Only after a paper is published (and after any questionable analyses have been conducted) can we check to see if an analysis was reproducible or conducted in error.


At this point it may be difficult to correct any mistakes if they are identified. Grad students have graduated, postdocs have left, people have moved on. In the Potti case, letters to the journal editors were ignored. While it may be better to check the research process at the end rather than to never check it, intervening at the post-publication phase is arguably the most expensive place to do it. At this phase of the research process, you are merely “medicating” the problem, to draw an analogy with chronic diseases. But fundamental data analytic damage may have already been done.

This medication aspect of reproducibility reminds me of a famous quotation from R. A. Fisher:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Reproducibility allows for the statistician to conduct the post mortem of a data analysis. But wouldn’t it have been better to have prevented the analysis from dying in the first place?

Moving Upstream

There has already been much discussion of changing the role of reproducibility in the publication/dissemination process. What if a paper had to be deemed reproducible before it was published? The question here is who will reproduce the analysis? We can't trust the authors to do it so we have to get an independent third party. What about peer reviewers? I would argue that this is a pretty big burden to place on a peer reviewer who is already working for free. How about one of the Editors? Well, at the journal Biostatistics, that’s exactly what we do. However, our policy is voluntary and only plays a role after a paper has been accepted through the usual peer review process. At any rate, from a business perspective, most journal owners will be reluctant to implement any policy that might reduce the number of submissions to the journal.

What Then?

To summarize, I believe reproducibility of computational research is very important, primarily to increase transparency and to improve knowledge sharing. However, I don’t think reproducibility in and of itself addresses the fundamental question of “Can I trust this analysis?”. Furthermore, reproducibility plays a role at the most downstream part of the research process (post-publication) where it is costliest to fix any mistakes that may be discovered. Ultimately, we need to think beyond reproducibility and to consider developing ways to ensure the quality of data analysis from the start.

How can we address the key problem concerning the validity of a data analysis? I’ll talk about what I think we should do in Part 2 of this post.