Simply Statistics: The Role of Resources in Data Analysis

When learning about data analysis in school, you don’t hear much about the role that resources—time, money, and technology—play in the development of analysis. This is a conversation that is often had “in the hallway” when talking to senior faculty or mentors. But the available resources do play a significant role in determining what can be done with a given question and dataset. It’s tempting to think that the situation is binary—either you have sufficient resources to do the “right” analysis, or you simply don’t do the analysis. But in the real world there are quite a few shades of gray in between those two endpoints. There are many situations in data analysis where the optimal approach is not feasible, but it is nevertheless important to do some sort of analysis. Thus, a critical skill for a data analyst to master is the ability to reconcile conflicting ideas while still producing something useful.

All analyses must deal with constraints on time and technology and that often shapes the plan for what can be done. For example, the complexity of the statistical model being used may be constrained by computing power available to the analyst, the ability to purchase more computing power, and the time available to run complex Markov chain Monte Carlo simulations. The analysis that is needed tomorrow will be different from the analysis that is needed next week. Yet the only thing different between the two is the time available to do the work.

The key resources of time, money, and technology have different effects on how a data analysis ultimately completed:

Time. Time usually serves as the biggest constraint and is obviously related to money. However, even if money is in abundance, it cannot buy more time if none is available. Complex analyses often involve many separate pieces, and complex data must be validated, checked, and interrogated before one can be confident about the results. All of this takes time and having less time leads to doing less of all those things. Similarly some analyses may require multiple people’s time, if one person cannot fit it all into their schedule. If multiple people are not currently available, that will change the nature of the analysis done.
Technology. I use the word “technology” broadly to refer to both computing resources and statistical “resources”. Some models may be more optimal than others, but characteristics of the dataset (such as its size) may prevent them from being applied. Better analyses may be done with more computing power, but a constraint on available computing power will determine what models get fit and how much extra work is done. Technological constraints may also be related to the audience that will receive the analysis. Depending on how sophisticated the audience is, one may tune the technology applied to do the analysis.

Approximations

Perhaps the oldest tool that statisticians have in their toolbox for dealing with resource constraints is approximation. Often it is straightforward to write down what the exact or ideal solution to a problem is but the computational burden makes it difficult to compute that solution. For example, many Bayesian computations require calculating complex high-dimensional integrals that were impossible before the invention of the digital computer. For complex non-linear solutions, a classic trick is to use a linear approximation and perhaps combine it with an assumption about asymptotic normality.

In most cases where computation was intractable, statisticians either resorted to (asymptotic) approximations, substituting difficult calculations with (sometimes dubious) assumptions, or chose different methods. A key point is that the harsh reality of the the real world’s resource constraints forced a different approach to analyzing data. While it might be unsatisfying to use a sub-optimal approach, it might be equally unsatisfying to not analyze the data at all.

As computing power has grown in the last century, we have been slowly replacing those old assumptions with computation. There is no need for asymptotic Normality if we can compute a less restrictive solution with a powerful computer. A simple example of this is the two-sample permutation test which is as powerful as a standard t-test but without any distributional assumptions. The problem, of course, is that those old assumptions die hard, and even today it can be cumbersome to code up a solution when a formula is right at hand.

Cheaper Hierarchical Modeling

One example from my own work involves hierarchical modeling of air pollution and health time series data. In the early 2000s, we were looking at national data on mortality and air pollution in the U.S. We had daily data on mortality and pollution (and many other covariates) in 100 major U.S. cities covering a time span of about 14 years. In order to make efficient use of this huge dataset, the goal was to employ a hierarchical model to estimate both a “national” association between air pollution and mortality, as well as city-specific estimates that borrowed strength across cities. It was a familiar approach that worked perfectly well in smaller datasets. The “right” approach would have been to use a Poisson likelihood for each of the cities (to model the mortality count data) and then have Normal random effects for the intercept and air pollution slopes.

But at the time, we didn’t have a computer that could actually compute the estimate from the model (or in our case, the posterior distributions). So the “right” model was not an option. What we ended up doing was using a Normal approximation for the Poisson likelihood, justified by the fairly large samples that we had, which allowed for a Normal-Normal two-stage model that could be computed without having to load all the data into memory (in the simplest case it could be done in closed form). To this day, this is the standard approach to modeling multi-site time series data of air pollution and health because it is fast, cheap, and easy to understand.

Trustworthiness

Ultimately, those resource constraints can affect how trustworthy the analysis is. In a trustworthy analysis, what is presented as the analysis is often backed up by many facts and details that are not presented. These other analyses have been done, but the analyst has decided (likely based on a certain narrative of the data) that they do not meet the threshold for presentation. That said, should anyone ask for those details, they are readily available. With greater resources, the sum total of all of the things that can be done is greater, thus giving us hope that the things left undone are orthogonal to what was done.

However, with fewer resources, there are at least two consequences. First, it is likely that fewer things can be done with the data. Fewer checks on the data, checks on model assumptions, checks of convergence, model validations, etc. This increases the number of undone things and makes it more likely that they will have an impact on the final (presented) results. Secondly, certain kinds of analysis may require greater time or computing power than is available. In order to present any analysis at all, we may need to resort to approximations or “cheaper” methodology. These approaches are not necessarily incorrect, but they may produce results that are noisier or otherwise sub-optimal. That said, the various other parties involved in the analysis, such as the audience or the patron, may prefer having any analysis done, regardless of optimality, over having no analysis. Sometimes the question itself is still vague or a bit rough, and so it’s okay if the analysis that goes with it is equally “quick and dirty”. Nevertheless, analysts have to draw the line between what what is a reasonable analysis and what is not, given the available resources.

Although resource constraints can impair the trustworthiness of an analysis, sometimes the use of approximations to deal with resource constraints can produce benefits. In the example above regarding the air pollution and mortality modeling, the approximation that we used made fitting the models to the data very fast. The benefit of the cheapness of the computation in this case allows the analyst to cycle through many different models to examine robustness of the findings to various confounding factors and to conduct important sensitivity analyses. If each model took days to compute, you might just settle for a single model fit. In other words, it’s possible that resource constraints could produce an analysis that, while approximate, is actually more trustworthy than the optimal analysis.

The Analyst’s Job

The data analyst’s job is to manage the resources available for analysis and produce the best analysis possible subject to the existing constraints. The availability of resources may not be solely up to the analyst, but the job is nevertheless to recognize what is available, determine whether the resources are sufficient for completing a reasonable analysis, and if not, then request more from those who can provide them. I’ve seen many data analyses go astray as a result of a mismatch in the understanding of the resources available versus the resources required.

A good data analyst can minimize the chance of a gross mismatch and will continuously evaluate the resource needs of an analysis going forward. If there appears to be a large deviation between what was expected and the reality of the analysis, then the analyst must communicate with others involved (the patron or perhaps subject matter expert) to either obtain more resources or modify the data analytic plan. Negotiating additional resources or a modified analytic plan requires the analyst to have a good relationship with the various parties involved.

You can hear more from me and the JHU Data Science Lab by subscribing to our weekly newsletter Monday Morning Data Science.