Simply Statistics: What can we learn from data analysis failures?

Back in February, I gave a talk at the Walter and Eliza Hall Research Institute in Melbourne titled “Lessons in Disaster: What Can We Learn from Data Analysis Failures?” This talk was quite different from talks that I usually give on computing or environmental health and I’m guessing it probably showed. It was nevertheless a fun talk and I got to talk about space-related stuff. If you want to hear some discussion of the development of this talk, you can listen to Episode 53 of Not So Standard Deviations.

It’s difficult to have a discussion about data analysis without some mention of John Tukey. In particular, his paper “The Future of Data Analysis”, published in the Annals of Statistics in 1962, weighs heavily. In fact, it weighs so heavily that the paper required its own table of contents! One paragraph from the end of Tukey’s massive paper has always struck me, in that his description of how we should teach data analysis is relatively simple, but we seem unable to implement it.

We would teach [data analysis] like biochemistry, with emphasis on what we have learned…with relegation of all question of detailed methods to the “laboratory work”. All study of detailed proofs…or comparisons of ways of presentation would belong in “the laboratory” rather than “in class”.

My interest is in taking this statement rather broadly and asking how often do we actually do this when it comes to data analysis?

Another statement that has fascinated me comes from Daryl Pregibon, who wrote in a 1991 National Research Council report titled The Future of Statistical Software,

Throughout American or even global industry, there is much advocacy of statistical process control and of understanding processes. Statisticians have a process they espouse but do not know anything about. It is the process of putting together many tiny pieces, the process called data analysis, and is not really understood.

The “putting together many tiny pieces” aspect of data analysis is really key. My guess is that Pregibon was referring to putting together many statistical tools and making all the little decisions about data that one always makes. However, often those little “pieces” are in fact people, and getting all of those people to fit together can be an equally challenging and equally critical aspect of success.

Learning about how data analyses succeeds or fails (but more importantly, fails) is extremely challenging without actually going through the process yourself. I don’t think I ever learned about it except through first hand experience, which took place over the course of years. There are a few reasons for this that I have observed over time:

Success in scientific data analysis is usually concerned with whether the claims made based on the results are true or not. If the results feel true, and the analysis appears rigorous, then that’s usually the end of the discussion. Focus is put on the result and what should come next. The underlying idea here is not necessarily misguided: Progress in science depends on independent replication, and any given analysis cannot be assigned too much weight.
When analyses fail, the results are usually vague and confusing. Furthermore, the public rarely finds out about them because they are not published. This is mostly due to human nature: it’s difficult to motivate oneself to write about an experience that was inconclusive and perhaps incoherent. It can also be embarrassing if honest mistakes were made. Publication of negative studies is a separate matter, because I would regard a truly negative study to be, in fact, conclusive. But often, we don’t even have that much clarity.
In the rare cases where we do find out about data analysis failures, the focus is often on who or what is to blame. In cases where criminal activity has taken place, this is an important aspect. However, identifying who or what is to blame usually doesn’t provide us with generalizable knowledge that we can apply to our own data analyses. The underlying assumption of this approach is that this failure was a unique situation that could never have happened if the individual to blame had not been involved. Occasionally, I see cases where there is a clear bug in some software that leads to erroneous results. Fixing the bug in the code will “fix” the results, but even in that situation it’s not clear to me that the bug is the ultimate cause of failure (although in this case it is the proximate cause).

I want to use one case study to think about what kinds of generalizable knowledge we can obtain from data analysis failures. The one I describe below is special because it had serious implications and large parts of it played out in public. While we likely will never know all of the details, we know enough to have a meaningful discussion.

The Duke Saga

The “Duke Saga” has been a tough nut to crack for me for many years now. While it’s fascinating because of the sheer number of problems that occurred, I’ve always struggled to identify exactly what went wrong. In other words, given what I know now, what intervention would I have taken to prevent a similar episode from happening in the future. I’ve long felt that the lessons people take away from this saga are not the correct ones in that applying the lessons to future work would not prevent failure.

First some background. Note that this is a highly abbreviated timeline:

In 2006, Nature Medicine published a paper by Potti et al. titled “Genomic signature to guide the use of chemotherapeutics”. The paper claimed to have developed a classifier based on applying microarray technology to cell lines maintained by the National Cancer Institute (NCI). They claimed the classifier could determine which patients would respond to chemotherapy treatment.
Keith Baggerly and Kevin Coombes at MD Anderson ~~Cancer~~ Center were inundated by requests from (justifiably) excited colleagues who wanted to use this technology. Baggerly and Coombes attempted to reproduce the results using the published description but were unable to do so. They were able to reproduce certain results in the paper after deliberately introducing a series of errors into the data analysis.
Since this initial incident, a number of other papers from the same lab were scrutinized and numerous errors in analyses were found, many that one might consider basic data handling and wrangling mistakes. In addition, Baggerly and Coombes found circumstantial evidence of deliberate fraud, such as claiming that certain genes were critical to a classifier even though those genes are not included in the microarray claimed to have been used.
Clinical trials were started at Duke where patients were randomized into different arms of the trial based on the flawed techniques developed by Potti. After numerous scientists wrote a letter to NCI director Harold Varmus, Duke suspended the trials to investigate the situation. An internal Duke panel eventually cleared Potti and colleagues of any wrongdoing and restarted the trials. Time passes and eventually it is discovered by the The Cancer Letter that Potti lied on an application for federal funding about once being a Rhodes Scholar. Eventually, the trials were stopped, but only after much public scrutiny and a series of lawsuits (some still ongoing).

I’ve obviously left out a lot of detail and if you want to hear more about this you can hear about it from Keith Baggerly himself in this nice lecture. However, I just wanted to give a sketch of what happened over a now more than 10 year period.

In my opinion, the details of the Duke Saga were salacious, but it was difficult to draw any conclusion about what actually went wrong and what approach should be taken to prevent something like this from happening again. Most people were just speculating about what could have happened and the people who really would know the details weren’t talking very much. Here’s how I would summarize the basic points that most people seemed to take away from the publicly available information about the saga:

Reproducibility. There was definitely a reproducible research angle to this saga, in that the analyses that were conducted lacked transparency. There was only sketchy code that was published along with paper and data were not immediately available. However, in a twisted sense, I think much of what came to light did so because the work was ultimately partially reproducible. That is in fact how Baggerly and Coombes discovered all the problems. They were able to reproduce the findings after deliberately introducing mistakes in the data. If one went back in time and magically forced everyone in the lab to use R Markdown or Juypter Notebooks, it’s not clear to me how that would have prevented anything. For starters, everyone within the team had access to the analyses and the data. It’s possible that people outside the team might have discovered problems sooner if the work had been completely reproducible, but Baggerly and Coombes figured things out relatively quickly. Also, that is besides the point: We should not depend on people outside the research team as a primary defense against data analytic failure. I don’t think reproducibility is one of the lessons learned from this saga because I don’t think it would have made a difference in this case.
Expertise. The basic narrative explaining this saga was that the data analyses were poorly done. Statisticians in particular have focused on the use of proprietary software, non-reproducible workflows (like pointing and clicking in Excel), and incorrect application of otherwise sound statistical methodology (e.g. cross validation). I’ve been involved in some discussions that suggested that if better-trained people had been doing the analyses, none of this would have happened. Perhaps genomic analyses are too complicated for the traditionally trained laboratory scientist. The idea is that this kind of work is “hard to do” and that you need better people (or improve existing people). I think that is the gist of the summary in this segment from 60 Minutes on the entire saga. I will discuss this more below.
Individual behavior. Anil Potti was eventually fired from Duke over this scandal and I don’t think anyone would disagree with that decision. If Duke had fired him 10 years ago, then yes, this research would not have happened at Duke, but it might have happened somewhere else, or it might have happened at Duke but with a different principal investigator. So while Potti was ultimately responsible for the analyses, his firing does not provide a useful “lesson learned”.

In January 2015, The Cancer Letter published a memo written by Bradford Perez, who in 2008 was a medical student trainee in the Potti lab. He saw what was going on in the lab and recognized its shoddiness. Problems that Baggerly and Coombes had to essentially reverse engineer, Perez saw first hand and immediately recognized them as serious. In fact, in 2008 he wrote a memo to the leadership of his institute describing some of those problems:

“Fifty-nine cell line samples with mRNA expression data…were split in half to designate sensitive and resistant phenotypes. Then in developing the model, only those samples which fit the model best in cross validation were included. Over half of the original samples were removed…. This was an incredibly biased approach which does little more than give the appearance of a successful cross validation.” [emphasis added]

He further wrote,

At this point, I believe that the situation is serious enough that all further analysis should be stopped to evaluate what is known about each predictor and it should be reconsidered which are appropriate to continue using and under what circumstances…. I would argue that at this point nothing…should be taken for granted. All claims of predictor validations should be independently and blindly performed.”

The memo was ignored by the leadership. Nothing was stopped and nothing was changed at the time. Perez eventually took his name off a series of papers and left the lab.

Lessons Learned

This memo is critical in my opinion because it fundamentally changes the narrative about what went wrong in this entire saga. Yes, genomic analyses are “hard to do” but clearly there was expertise in the lab to recognize that difficulty and to recognize when statistical methods were being incorrectly applied. The problem was not a lack of training, nor was it simply the result of a few honest data management mistakes here and there. The problem was a breakdown in communication and a total lack of trust between investigators and members of the data analytic team. Perez clearly felt uncomfortable raising these issues in the lab and wrote the memo knowing that he had “much to lose”. He thought the problem in the lab was that statistical methods were being misapplied, but the real problem in the lab was that he did not feel comfortable discussing it. A breakdown in the relationship between an analyst and an investigator is a serious data analytic problem.

It’s possible for me to imagine an alternate scenario where a data analyst like Perez sees a problem with the way models are being developed or applied, mentions this to the principal investigator and has a detailed discussion, perhaps seeks outside expertise (e.g. from a statistician), and then modifies the procedure to fix the problem. It’s easy for me to imagine this because it happens pretty much every day. No data analysis is perfect from start to finish. Changes and course corrections are constantly made along the way. When I analyze data and run into problems that can be traced to data collection, I will raise this with the PI. When I give results to other investigators, sometimes the results don’t seem right to them and they come to me and seek clarification. If it’s a mistake on my part, I’ll fix it and send them updated results.

When the relationships between an analyst and various members of the investigator team are strong and there is substantial trust between them, honest mistakes are just minor bumps in the road that can be uncovered, discussed, and fixed. When there is a breakdown in those relationships, the exact same mistakes are covered up, denied, and buried. A breakdown in the relationships between analysts and other investigators on the team generally cannot be fixed with a better statistical method, or a reproducible workflow, or open source software. Recognizing that this is the problem is difficult because often there is no easy solution.

I think the data analytic lesson learned from the Duke Saga is that data analysts need to be allowed to say “stop”. But also, the ability to do so depends critically on the relationships between the analyst and members of the investigator team. If an analyst feels uncomfortable raising analytic issues with other members, then arguably all analyses done by the team are at risk. No amount of statistical expertise or tooling can fix this fundamental human problem.