Simply Statistics: Relationships in Data Analysis

I recently finished reading Steve Coll’s book Directorate S, which is a chronicle of the U.S. war in Afghanistan post 9-11. It’s a good book, and one line stuck out for me as I thought it had relevance for data analysis. In one chapter, Coll writes about Lieutenant Colonel John Loftis, who helped run a training program for U.S. military officials who were preparing to go serve in Afghanistan. In reference to Afghan society, he says, “Everything over there is about relationships.” At the time, Afghanistan had few independent institutions and accomplishing certain tasks depended on knowing certain people and having a good relationship with them.

I find data analysis to be immature as an independent field. It uses many tools–mathematics, statistics, computer science–that are mature and well-studied. But the act of analyzing data is not particularly well-studied. And like any immature organization (or nation), much of data analysis still has to do with human relationships. I think this is an often ignored aspect of data analysis because people hold out hope that we can build the tools and technology to the point where we do not need to rely on relationships. Eventually, we will find the approaches that are universally correct and so there will be little need for discussion.

Human relationships are unstable, unpredictable, and inconsistent. Algorithms and statistical tools are predictable and in some cases, optimal. But for whatever reason, we have not yet been able to completely characterize all of the elements that make a successful data analysis in a “machine readable” format. We haven’t developed the “institutions” of data analysis that can operate without needing the involvement of specific individuals. Therefore, because we have not yet figured out a perfect model for human behavior, data analysis will have to be done by humans for just a bit longer.

In my experience, there are a few key relationships that need to be managed in any data analysis and I discuss them below.

Data Analyst and Patron

At the end of the day, someone has to pay for data analysis, and this person is the patron. This person might have gotten a grant, or signed a customer, or simply identified a need and the resources for doing the analysis. The key thing here is that the patron provides the resources and determines the tools available for analysis. Typically, the resources we are concerned with are time available to the analyst. The Patron, through the allocation of resources, controls the scope of the analysis. If the patron needs the analysis tomorrow, the analysis is going to be different than if they need it in a month.

A bad relationship here can lead to mismatched expectations between the patron and the analyst. Often the patron thinks the analysis should take less time than it really does. Conversely, the analyst may be led to believe that the patron is deliberately allocating fewer resources to the data analysis because of other priorities. None of this is good, and the relationship between the two must be robust enough in order to straighten out any disagreements or confusion.

Data Analyst and Subject Matter Expert

This relationship is critical because the data analyst must learn the context around the data they are analyzing. The subject matter expert can provide that context and can ask the questions that are relevant to the area that the data inform. The expert is also needed to help interpret the results and to potentially place them in a broader context, allowing the analyst to assess the practical significance (as opposed to statistical significance) of the results. Finally, the expert will have a sense of the potential impact of any results from the analysis.

A bad relationship between the analyst and the expert can often lead to

Irrelevant analysis. Lack of communication between the expert and the analyst may lead the analyst to go down a road that is not of interest to the audience, no matter how correct the analysis is. In my experience, this outcome is most common when the analyst does not have any relationship with a subject matter expert.
Mistakes. An analyst’s misunderstanding of some of the data or the data’s context may lead to analysis that is relevant but incorrect. Analysts must be comfortable clarifying details of the data with the expert in order to avoid mistakes.
Biased interpretation. The point here is not that a bad relationship leads to bias, but rather a bad relationship can lead the expert to not trust the analyst and their analysis, leading the expert to rely more strongly on their preconceived notions. A strong relationship between the expert and the analyst could lead to the expert being more open to evidence that contradicts their hypotheses, which can be critical to reducing hidden biases.

Data Analyst and Audience

The data analyst needs to find some way to assess the needs and capabilities of the audience, because there is always an audience. There will likely be many different ways to present the results of an analysis and it is the analyst’s job to figure what might be the best way to make the results acceptable to the audience. Important factors may include how much time the audience has to see the presentation, how mathematically inclined/trained they are, whether they have any background in the subject matter, what “language” they speak, or whether the audience will need to make a clear decision based on your analysis. Similar to the subject matter expert, if the analyst has a bad relationship with the audience, the audience is less likely to trust the analysis and to accept its results. In the worst case, the audience will reject the analysis without seriously considering its merits.

Analysts often have to present the same analysis to multiple audiences, and they should be prepared to shift and rearrange the analysis to suit those multiple audiences. Perhaps a trivial, but real, example of this is when I go to give talks at places where I know a person there has developed a method that is related to my talk, I’ll make sure to apply their method to my data and compare it to other approaches. Sometimes their method is genuinely better than other approaches, but most of the time it performs comparably to other approaches (alas, that is the nature of most statistical research!). Nevetheless, it’s a simple thing to do and usually doesn’t require a lot of extra time, but it can go a long way to establishing trust with the audience. This is just one example of how consideration of the audience can change the underlying analysis itself. It’s not just a matter of how the results are presented.

Implications

It’s tempting to think that the quality of a data analysis only depends on the data and the modeling applied to it. We’re trained to think that if the data are consistent with a set of assumptions, then there is an optimal approach that can be taken to estimate a given parameter (or posterior distribution). But in my experience, that is far from the reality. Often, the quality of an analysis can be driven by the relationships between the analyst and the various people that have a stake in the results. In the worst case scenario, a breakdown in relationships can lead to serious failure.

I think most people who analyze data would say that data analysis is easiest when the patron, subject matter expert, the analyst, and the audience are all the same person. The reason is because the relationships are all completely understood and easy to manage. Communication is simple when it only has to jump from one neuron to another. Doing data analysis “for yourself” is often quick, highly iterative, and easy to make progress. Collaborating with others on data analysis can be slow, rife with miscommunication, and frustrating. One common scenario is where the patron, expert, and the audience are the same person. If there is a good relationship between this person and the analyst, then the data analysis process here can work very smoothly and productively.

Combining different roles into the same person can sometimes lead to conflicts:

Patron is combined with the audience. If the audience is paying for the work, then they may demand results that confirm their pre-existing biases, regardless of what evidence the data may bring.
Subject matter expert and the analyst are the same person. If this person has strong hypotheses about the science, they may be tempted to drive the data analysis in a particular direction that does not contradict those hypotheses. The ultimate audience may object to these analyses if they see contradictory evidence being ignored.
Analyst and audience are the same. This could lead to a situation where the audience is “too knowledgeable” about the analysis to see the forest for the trees. Important aspects of the data may go unnoticed because the analyst is too deep in the weeds. Furthermore, there is potential value and forcing an analyst to translate their findings for a fresh audience in order to ensure that the narrative is clear and that the evidence as strong as they believe.

Separating out roles in to different people can also lead to problems. In particular, if the patron, expert, analyst, and audience are all separate, then the relationships between all four of those people must in some way be managed. In the worst case, there are 6 pairs of relationships that must be on good terms. It may however be possible for the analyst to manage the “information flow” between the various roles, so that the relationships between the various other roles are kept separate for most of the time. However, this is not always possible or good for the analysis, and managing the various people in these roles is often the most difficult aspect of being a data analyst.