Defining success in data analysis has eluded me for quite some time now. About two years ago I tried to explore this question in my Dean’s Lecture, but ultimately I think I missed the mark. In that talk I tried to identify standards (I called them “aesthetics”) by which we could universally evaluate the quality of a data analysis and tried to make an analogy with music theory. It was a fun talk, in part because I got to play the end of Charles Ives’ Second Symphony.
Statisticians, in my experience, do not discuss this topic very much. It’s either because it’s so stupid that everyone has an (unspoken) understanding of it, or that everyone kind of has a slightly different understanding of it, or that no one understands it. Either way, in my close to twenty years as a statistician, I don’t think I’ve had many in-depth conversations with anyone about what makes a data analysis successful. The most that I’ve ever discussed this topic is on Not So Standard Deviations with Hilary Parker, where this is a frequent topic of conversation. Recently, Hilary gave a talk related to this topic (slides here), and so I was inspired to write something.
I think I’ve come around to the following definition of data analysis success, which is,
A data analysis is successful if the audience to which it is presented accepts the results.
There are a number of things to unpack here, so I will walk through them. Two key notions that I think are important are the notions of acceptance and the audience.
The first idea is the notion of acceptance. It’s tempting to confuse this with belief, but they are two different concepts that need to be kept separate (although that can be difficult at times). Acceptance of an analysis involves the analysis itself—the data and the methods applied to it, along with the narrative told to explain the results. Belief in the results depends on the analysis itself as well as many other things outside the analysis, including previous analyses, existing literature, and the state of the science (in purely Bayesian terms, your prior). A responsible audience can accept an analysis without necessarily believing its principal claims, but these two concepts are likely to be correlated.
For example, suppose a team at your company designs an experiment to collect data to determine if lowering the price of a widget will have an effect on profits for your widget-making company. During the data collection process, there was a problem which resulted in some of the data being missing in a potentially informative way. The data are then handed to you. You do your best to account for the missingness and the resulting uncertainty, perhaps through multiple imputation or other adjustment methods. At the end of the day, you show me the analysis and conclude that lowering the price of a widget will increase profits 3-fold. I may accept that you did the analysis correctly and trust that you did your best to account for the problems encountered during collection using state-of-the-art methods. But I may disagree with the conclusion, in part because of the problems introduced with the missing data (not your fault), but also in part because we had previously lowered prices on another product that we sell and there was no corresponding increase in profits. Given the immense cost of doing the experiment, I might ultimately decide that we should abandon trying to modify the price of widgets and leave things where they are (at least for now). The analysis was a success.
This simple example illustrates two things. First, acceptance of the analysis depends primarily on the details of the analysis and my willingness to trust what the analyst has done. Was the missing data accounted for? Was the uncertainty properly presented? Can I reason about the data and understand how the data influence the results? Second, my belief in the results depends in part on things outside the analysis, things that are primarily outside the analyst’s control. In this case, these are the presence of missing data during collection and a totally separate experience lowering prices for a different product. How I weigh these external things, in the presence of your analysis, is a personal preference.
In scientific contexts it is tempting to think about validity. Here, a data analysis is successful if the claims made are true. If I analyze data on smoking habits and mortality rates and conclude that smoking causes lung cancer, then my analysis is successful if that claim is true. This definition has the advantage that it removes the subjective element of acceptance, which depends on the audience to which an analysis is presented. But validity is an awfully high bar to meet for any given analysis. In this smoking example, initial analyses of smoking and mortality data could not be deemed successful or not until decades after they were done. Most scientific conclusions require multiple replications occurring over many years by independent investigators and analysts before the community believes or concludes that they are true. Leaving data analysts in limbo for such a long time seems impractical and, frankly, unfair. And ultimately, I don’t think we want to penalize data analysts for making conclusions that turn out to be false, as long as we believe they are doing good work. Whether those claims turn out to be true or not may depend on things outside their control.
A related standard for analyses is essentially a notion of intrinsic validity. Rather than wait until we can validate a claim made by an analysis (perhaps decades down the road), we can evaluate an analysis by whether the correct or best approach was done and the correct methods were applied. But there are at least two problems with this approach. In many scenarios it is not possible to know what is the best method, or what is the best combination of methods to apply, which would suggest that in many analyses, we are uncertain of success. This seems rather unsatisfying and ultimately impractical. Imagine hiring a data analyst and saying to them “In the vast majority of analyses that you do, we will not know if you are successful or not.” Second, even in the ideal scenarios, where we know what is correct or best, intrinsic validity is necessary but far from sufficient. This is because the context in which an analysis performed is critical in understanding what is appropriate. If the analyst is unaware of that context, they may make critical mistakes, both from an analytical and interpretative perspective. However, those same mistakes might be innocuous in a different context. It all depends, but the analyst needs to know the difference.
One story that comes to mind comes from the election victory of George W. Bush over Al Gore in the 2000 United States presidential election. That election hinged on votes counted in the state of Florida, where Bush and Gore were very close. Ultimately, lawsuits were filed and a trial was set to determine exactly how the vote counting should proceed. Statisticians were called to testify for both Bush and Gore. The statistician called to testify for the Gore team was Nicolas Hengartner, formerly of Yale University (he was my undergraduate advisor when I was there). Hengartner presented a thorough analysis of the data that was given to him by the Gore team and concluded there were differences in how the votes were being counted across Florida and that some ballots were undercounted. However, on cross-examination, the lawyer for Bush was able to catch Hengartner in “gotcha” moment which ultimately had to do with the manner in which the data were collected, about which Hengartner had been unaware. Was the analysis a success? It’s difficult to say without the having been directly involved. Nobody challenged the methodology that Hengartner used in the analysis, which was by all accounts a very simple analysis. Therefore, one could argue that it had intrinsic validity. However, one could also argue that he should have known about the issue with how the data were collected (and perhaps the broader context) and incorporated that into his analysis and presentation to the court. Hengartner’s analysis was only one piece in a collection of evidence presented and so it’s difficult to say what role it played in the ultimate outcome.
All data analyses have an audience, even if that audience is you. Ultimately, the audience may accept the results of an analysis or they may fail to accept it, in which case more analyses may need to be done. The fact that an analyst’s success may depend on a person different from the analyst may strike some as an uncomfortable feature. However, I think this is the reality of all data analyses. Success depends on human beings, unfortunately, and this is something analysts must be prepared to deal with. Recognizing that human nature plays a key role in determining the success of data analysis explains a number of key aspects of what we might consider to be good or bad analyses.
Data analysis is supposed to be about the data, right? Just the facts? And for the most part it is, up until the point you need to communicate your findings to an audience. The problem is that in any data analysis that would be meaningful to others, there are simply too many results to present, and so choices must be made. Depending on who the audience is, or who the audience is composed of, you will need to tune your presentation in order to get the audience to accept the analysis. How is this done? Here are two extremes.
In the worst case scenario, it is done through trickery. Graphs with messed up axes, or tables that obscure key data; we all know the horror stories. A sophisticated audience might detect this kind of trickery and reject the analysis, but maybe not. That said, let’s assume we are pure of heart. How does one organize a presentation to be successful? We all know the other horror story, which is the data dump. Here, the analyst presents everything they have done and essentially shifts the burden of interpretation on to the audience. Rarely is this desired. In some cases the audience will just want the data to do their own analyses, but then there’s no need for the analyst to waste their time doing any analysis.
Ultimately, the analyst must choose what to present, and this can cause problems. The choices must be made to fit the analyst’s narrative of “what is going on with the data”. They will choose to include some plots and not others and some tables and not others. These choices are directed by a narrative and an interpretation of the data. When an audience is upset by a data analysis, and they are being honest, they are usually upset with the chosen narrative, not with the facts per se. They will be upset with the combination of data that the analyst chose to include and the data that the analyst chose to exclude. Why didn’t you include that data? Why is this narrative so focused on this or that aspect?
On one extreme, it could be thought that a data analyst should be easily replaced by a machine: For various types of data and for various types of questions, there should be a deterministic approach to analysis that does not change. Presumably, this could be coded up into a computer program and the data could be fed into the program every time, with a result presented at the end. How is it that every data analysis is so different that a human being is needed to craft a solution? How can the words “creativity” and “data analysis” even appear in the same sentence?
Well, it’s not true that every analysis is literally different. Many power calculations, for example, are identical. However, exactly how those power calculations are used can vary quite a bit from project to project. Even the very same calculation for the same study design can be interpreted differently in different projects. The same is true for other kinds of analyses like regression modeling or other more fancy modeling. The reason creativity is needed in data analysis has to do fundamentally with things that we might traditionally think are “outside” the data.
The audience is a key factor that is “outside the data” and influences how we analyze the data and present the results. One useful approach is to think about what final products need to be produced and then work backwards from there to produce the result. For example, if the “audience” is another algorithm or procedure, then the exact nature of the output may not be important as along as it can be appropriately fed into the next part of the pipeline. In particular, interpretability may not weigh that heavily because no person will be looking at the output of this part. However, if a person will be looking at the results, then you may want to focus on a modeling approach that lets that person reason about the data and understand how the data inform the results. For example, you might want to make more plots of the data, or show detailed tables if the dataset is not that large.
In one extreme case, if the audience is another data analyst, you may want to do a relatively “light” analysis, but then prepare the data in such a way that it can be easily distributed to others to do their own analysis. This could be in the form of an R package or a CSV file or something else. Other analysts may not care about your fancy visualizations or models; they’d rather have the data for themselves and make their own results.
Creativity is needed in part because a data analyst must make a reasonable assessment of the audience’s needs, background, and preferences for receiving data analytic results. If the analyst has access to the audience, the analyst should ask questions about how best to present results. Otherwise, reasonable assumptions must be made or contingencies (e.g. backup slides, appendices) can be prepared for the presentation itself.
Many times I’ve had the experience of giving the same presentation to two different audiences. One audience loves it while the other hates it. How can that be if the analyses and presentation were exactly the same in both cases? The truth is that an analysis can be accepted or rejected by different audiences depending on who they are and what their expectations are. A common scenario involves giving a presentation to “insiders” who are keenly familiar with the context and the standard practices in the field. Taking that presentation verbatim to an “outside” audience that is less familiar will often result in failure because they will not understand what is going on. If that outside audience expects a certain set of procedures be applied to the data, then they may demand that you do the same, and refuse to accept the analysis until you do so.
I vividly remember one experience that I had presenting the analysis of some air pollution and health data that I had done. In practice talks with my own group everything had gone well and I thought things were reasonably complete. When giving the same talk to an outside group, they refused to accept what I’d done (or even interpret the results) until I had also run a separate analysis using a different kind of spline model. It wasn’t an unreasonable idea, so I did the separate analysis and in a future event with the same group I presented both analyses side by side. They were not wild about the conclusions, but the debate no longer centered on the analyses themselves and instead focused on other scientific aspects. In retrospect, I give them credit for accepting the analyses even if they did not necessarily believe the conclusion.
I think my proposed definition of a successful data analysis is challenging (and perhaps unsettling) because it suggests that data analysts are responsible for things outside the data. In particular, they need to understand the context around which the data are collected and the audience to which results will be presented. I also think that’s why it took so long for me to come around to it. But I think this definition explains much more clearly why it is so difficult to be a good data analyst. When we consider data analysis using traditional criteria developed by statisticians, we struggle to explain why some people are better data analysts than others and why some analyses are better than others. However, when we consider that data analysts have to juggle a variety of factors both internal and external to the data in order to achieve success, we see more clearly why this is such a difficult job and why good people are hard to come by.
Another implication of this definition of data analysis success is that it suggests that human nature plays a big role and that much of successful data analysis is essentially a successful negotiation of human relations. Good communication with an audience can often play a much bigger role in success than whether you used a linear model or quadratic model. Trust between an analyst and audience is critical when an analyst must make choices about what to present and what to omit. Admitting that human nature plays a role in data analysis success is difficult because humans are highly subjective, inconsistent, and difficult to quantify. However, I think doing so gives us a better understanding about how to judge the quality of data analyses and how to improve them in the future.