The success of a data analysis depends critically on the audience. But why? A lot has to do with whether the audience trusts the analysis as well as the person presenting the analysis. Almost al presentations are incomplete because for any analysis of reasonable size, some details must be omitted for the sake of clarity. A good presentation will have a structured narrative that will guide the presenter in choosing what should be included and what should be omitted. However, audiences will vary in their acceptance of that narrative and will often want to know if other details exist.
Consider the following scenario:
A person is analyzing some data and is trying to determine if two features, call them X and Y, are related to each other. After looking at the data for some time, they come to you and declare that the Pearson correlation between X and Y is 0.85 and therefore conclude that X and Y are related.
The question then is, do you trust this analysis?
Given the painfully brief presentation of the scenario, I would imagine that most people experienced in data analysis would say something along the lines of “No”, or at least “Not yet”. So, why would we not trust this analysis?
There are many questions that one might ask before one were to place any trust in the results of this analysis. Here are just a few:
The above questions about the presentation and statistical methodology are all reasonable and would likely come up in this scenario. In fact, there would likely be even more questions asked before a one could be assured that the analysis was trustworthy, but this is just a smattering.
I think it’s reasonable to think that a good analyst would have concrete answers to all of these questions even though they were omitted from the presentation.
One might think of other things to do, but the items listed above are in direct response to the questions asked before.
My “analysis of variance” representation of a data analysis is roughly
Here we have
We can only observe A and B and need to speculate about C. The times when I most trust an analysis is when I believe that the C component is relatively small, and is essentially orthogonal to the other components of the equation (A and B). In other words, were one to actually do the things in the “Not Done” bucket, they would have no influence on the overall results of the analysis. There should be nothing surprising or unexpected in the C component.
No matter what data is being analyzed, and no matter who is doing the analysis, the presentation of an analysis must be limited, usually because of time. Choices must be made to present a selection of what was actually done, therefore leaving a large number of items in the “Done but not Presented” component. An analogy might be when one writes slides for a presentation, often there are a few slides that are left in the back of the slide deck that are not presented but are easily retrieved should a question come up. The material in those slides was important enough to warrant making a slide, but not important enough to make it into the presentation. In any substantial data analysis, the number of “slides” presented as the results is relatively small while the number of “slides” held in reserve is potentially huge.
Another large part of a data analysis concerns who is presenting. This person may or may not have a track record of producing good analyses and the background of the presenter may or may not be known to the audience. My response to the presentation of an analysis tends to differ based on who is presenting and my confidence in their ability to execute a good analysis. Ultimately, I think my approach to reviewing an analysis comes down to this:
One of the implications of this process is that two different presenters could make the exact same presentation and my response to them will be different. This is perhaps an unfortunate reality and opens the door to introducing all kinds of inappropriate biases. However, my understanding of the presenters’ abilities will affect how much I ask about B and C.
At the end of the day, I think an analysis is trustworthy when my understanding of A and B is such that I have reasonable confidence that C is orthogonal. In other words, there’s little else that can be done with the data that will have a meaningful impact on the results.
As an analyst it might be useful to think of what are the things that will fall into components A, B, and C. In particular, how one thinks about the three components will likely depend on the audience to which the presentation is being made. In fact, the “presentation” may range from sending a simple email, to delivering a class lecture, or a keynote talk. The manner in which you present the results of an analysis is part of the analysis and will play a large role in determining the success of the analysis. If you are unfamiliar with the audience, or believe they are unfamiliar with you, you may need to place more elements in components A (the presentation), and perhaps talk a little faster. But if you already have a long-term relationship with the audience, a quick summary (with lots of things placed into component B) may be enough.
One of the ways in which you can divide up the things that go into A, B, and C is to develop a good understanding of the audience. If the audience enjoys looking at scatterplots and making inquiries about individual data points, then you’re going to make sure you have that kind of detailed understanding in the data, and you may want to just put that kind of information up front in part A. If the audience likes to have a higher level perspective of things, you can reserve the information for part B.
Considering the audience is useful because it can often drive you to do analyses that perhaps you hadn’t thought to do at first. For example, if your boss always wants to see a sensitivity analysis, then it might be wise to do that and put the results in part B, even if you don’t think it’s critically necessary or if it’s tedious to present. On occasion, you might find that the sensitivity analyses in fact sheds light on an unforeseen aspect of the data. It would be nice if there were a “global list of things to do in every analysis”, but there isn’t and even if there were it would likely be too long to complete for any specific analysis. So one way to optimize your approach is to consider the audience and what they might want to see, and to merge that with what you think is needed for the analysis.
If you are the audience, then considering the audience’s needs is a relatively simple task. But often the audience will be separate (thesis committee, journal reviewers/editors, conference attendees) and you will have to make your best effort at guessing. If you have direct access to the audience, then a simpler approach would be to just ask them. But this is a potentially time-consuming task (depending on how long it takes for them to respond) and may not be feasible in the time frame allowed for the analysis.
It’s entirely possible to trust an analysis but not believe the final conclusions. In particular, if this is the first analysis of it’s kind that you are seeing, there’s almost no reason to believe that the conclusions are true until you’ve seen other independent analysis. An initial analysis may only have limited preliminary data and you may need to make a decision to invest in collecting more data. Until then, there may be no way to know if the analysis is true or not. But the analysis may still be trustworthy in the sense that everything that should have been done was done.
Looking back at the original “presentation” given at the top, one might ask “So, is X correlated with Y?”. Maybe, and there seems to be evidence that it is. However, whether I ultimately believe the result will depend on factors outside the analysis.
You can hear more from me and the JHU Data Science Lab by subscribing to our weekly newsletter Monday Morning Data Science.