Simply Statistics: Interview with Abhi Datta

Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at [@datta_science](https://twitter.com/datta_science). If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!

SS: Do you consider yourself a statistician, biostatistician, data scientist, or something else?

AD: That is a difficult question for me, as I enjoy working on theory, methods and data analysis and have co-authored diverse papers ranging from theoretical expositions to being primarily centered around a complex data analysis. My research interests also span a wide range of areas. A lot of my work on spatial statistics is driven by applications in environmental health and air pollution. Another significant area of my research is developing Bayesian models for epidemiological applications using survey data.

I would say what I enjoy most is developing statistical methodology motivated by a complex application where current methods fall short, applying the method for analysis of the motivating data, and trying to see if it is possible to establish some guarantees about the method through a combination of theoretical studies and empirical experiments that will help to generalize applicability of the method for other datasets. Of course, not all projects involve all the steps, but that is my ideal workflow. Not sure what that classifies me as.

SS: How did you get into statistics? What was your path to ending up at Hopkins?

AD: I was born and grew up in Kolkata, India. I had the option of going for engineering, medical or statistics undergrad. I chose statistics persuaded by my appreciation for mathematics and the reputation of the statistics program at Indian Statistical Institute (ISI), Kolkata. I completed my undergrad (BStat) and Masters (MStat) in Statistics from ISI and I’m thankful I made that choice as those 5 years at ISI played a pivotal role in my life. Besides getting rigorous training in the foundations of statistics, most importantly, I met my wife Dr. Debashree Ray at ISI.

After my Masters, I had a brief stint in the finance industry, working for 2 years at Morgan Stanley (in Mumbai and then in New York City) before I joined the PhD program at the Division of Biostatistics at University of Minnesota (UMN) in 2012 where Debashree was pursuing her PhD in Biostatistics. I had initially planned to work in Statistical Genetics as I had done a research project in that area in my Master’s. However, I explored other research areas in my first year and ended up working on spatial statistics under the supervision of my advisor Dr. Sudipto Banerjee, and on high-dimensional data with my co-advisorDr. Hui Zou from the Department of Statistics in Minnesota. I graduated from Minnesota in 2016 and joined Hopkins Biostat as an Assistant Professor in the Fall of 2016.

SS: You work on large scale spatio-temporal modeling - how do you speed up computations for the bootstrap when the data are very large?

AD: A main computational roadblock in spatio-temporal statistics is working with very big covariance matrices that strain memory and computing resources typically available in personal computers. Previously, I have developed nearest neighbor Gaussian Processes (NNGP) – a Bayesian hierarchical model for inference in massive geospatial datasets. One issue with hierarchical Bayesian models is their reliance on long sequential MCMC runs. Bootstrap, unlike MCMC, can be implemented in an embarrassingly parallel fashion. However, for geospatial data, all observations are correlated across space prohibiting direct resampling for bootstrap.

In a recent work with my student Arkajyoti Saha, we proposed a semi-parametric bootstrap for inference on large spatial covariance matrices. We use sparse Cholesky factors of spatial covariance matrices to approximately decorrelate the data before resampling for bootstrap. Arkajyoti has implemented this in an R-package BRISC: Bootstrap for rapid inference on spatial covariances. BRISC is extremely fast and at the time of publication, to my knowledge, it was the only R-package that offered inference on all the spatial covariance parameters without using MCMC. The package can also be used simply for super-fast estimation and prediction in geo-statistics.

SS: You have a cool paper on mapping local and global trait variation in plant distributions, how did you get involved in that collaboration? Does your modeling have implications for people studying the impacts of climate change?

AD: In my final year of PhD at UMN, I was awarded the Inter-Disciplinary Doctoral Fellowship – a fantastic initiative by the graduate school at UMN providing research and travel funding, and office space to work with an inter-disciplinary team of researchers on a collaborative project. In my IDF, mentored by Dr. Arindam Banerjee and Dr. Peter Reich, I worked with a group of climate modelers, ecologists and computer scientists from several institutions on a project whose eventual goal is to improve carbon projections from climate models.

The paper you mention was aimed at improving the global characterization of plant traits (measurements). This is important as plant trait values are critical inputs to climate model. Even the largest plant trait database TRY offers poor geographical coverage with little or no data across many large geographical regions. We used the fast NNGP approach I had been developing in my PhD to spatially gap-fill the plant trait data to create a global map of important plant traits with proper uncertainty quantification. The collaboration was a great learning experience for me on how to conduct a complex data analysis, and how to communicate with scientists.

Currently, we are looking at ways to incorporate the uncertainty quantified trait values as inputs to Earth System Models (ESMs) – the land component of climate models. We hope that replacing single trait values with entire trait distributions as inputs to these models will help to better propagate the uncertainty and improve the final model projections.

SS: What project has you most excited at the moment?

AD: There are two. I have been working with Dr. Scott Zeger on a project lead by Dr. Agbessi Amouzou in the Department of International Health at Hopkins aiming to estimate the cause-specific fractions (CSMF) of child mortality in Mozambique using family questionnaire data (verbal autopsy). Verbal autopsies are often used as a surrogate to full autopsy in many countries and there exists software that use these questionnaire data to predict a cause for every death. However, these software are usually trained on some standard training data and yield inaccurate predictions in local context. This problem is a special case of transfer learning where a model trained using data representing a standard population offers poor predictive accuracy when specific populations are of interest. We have developed a general approach for transfer learning of classifiers that uses the predictions from these verbal autopsy software and limited full autopsy data from the local population to provide improved estimates of cause-specific mortality fractions. The approach is very general and offers a parsimonious model-based solution to transfer learning and can be used in any other classification-based application.

The second project involves creating high-resolution space-time maps of particulate matter (PM2.5) in Baltimore. Currently a network of low-cost air pollution monitors is being deployed in Baltimore that promises to offer air pollution measurements at a much higher geospatial resolution than what is provided by EPA’s sparse regulatory monitoring network. I was awarded a Bloomberg American Health Initiative Spark award for working with Dr. Kirsten Koehler in the Department of Environmental Health and Engineering to combine the low-cost network data, the sparse EPA data and other land-use covariates to create uncertainty quantified maps of PM2.5 at an unprecedented spatial resolution. We have just started analyzing the first two months of data and I’m really looking forward to help create the end-product and understand how PM2.5 levels vary across the different neighborhoods in Baltimore.

SS: You have an interest in soccer and spatio temporal models have played an increasing role in soccer analytics. Have you thought about using your statistics skills to study soccer or do you try to avoid mixing professional work and being a fan?

AD: Yes, I’m an avid soccer fan. I have travelled to Brazil in 2014 and Russia in 2018 to watch live games in the world cups. It also unfortunately means that I set my alarm to earlier times on weekends than on weekdays as the European league games start pretty early in US time.

However, until recent times, I’ve been largely ignorant of applications of spatio-temporal statistics in soccer analytics. I just finished teaching a Spatial Statistics course and one of the students presented a fascinating work he has done on predicting player’s scoring abilities using spatial statistics. I certainly plan to read more literature on this and maybe one day can contribute. Till then I remain a fan.