Tag: future of statistics


The bright future of applied statistics

In 2013, the Committee of Presidents of Statistical Societies (COPSS) celebrates its 50th Anniversary. As part of its celebration, COPSS will publish a book, with contributions from past recipients of its awards, titled “Past, Present and Future of Statistical Science". Below is my contribution titled The bright future of applied statistics.

When I was asked to contribute to this issue, titled Past, Present, and Future of Statistical Science, I contemplated my career while deciding what to write about. One aspect that stood out was how much I benefited from the right circumstances. I came to one clear conclusion: it is a great time to be an applied statistician. I decided to describe the aspects of my career that I have thoroughly enjoyed in the past and present and explain why I this has led me to believe that the future is bright for applied statisticians.

I became an applied statistician while working with David Brillinger on my PhD thesis. When searching for an advisor I visited several professors and asked them about their interests. David asked me what I liked and all I came up with was "I don't know. Music?", to which he responded "That's what we will work on". Apart from the necessary theorems to get a PhD from the Statistics Department at Berkeley, my thesis summarized my collaborative work with researchers at the Center for New Music and Audio Technology. The work
involved separating and parameterizing the harmonic and non-harmonic components of musical sound signals [5]. The sounds had been digitized into data. The work was indeed fun, but I also had my first glimpse into the incredible potential of statistics in a world becoming more and more data-driven.

Despite having expertise only in music, and a thesis that required a CD player to hear the data, fitted models and residuals (http://www.biostat.jhsph.edu/~ririzarr/Demo/index.html), I was hired by the Department of Biostatistics at Johns Hopkins School of Public Health. Later I realized what was probably obvious to the School’s leadership: that regardless of the subject matter of my thesis, my time series expertise could be applied to several public health applications [821]. The public health and biomedical challenges surrounding me were simply too hard to resist and my new
department knew this. It was inevitable that I would quickly turn into an applied Biostatistician.

Since the day that I arrived at Hopkins 15 years ago, Scott Zeger, the department chair, fostered and encouraged faculty to leverage their statistical expertise to make a difference and to have an immediate impact in science. At that time, we were in the midst of a measurement revolution that was transforming several scientific fields into data-driven ones. By being located in a School of Public Health and next to a medical school, we were surrounded by collaborators working in such fields. These included environmental science, neuroscience, cancer biology, genetics, and molecular biology. Much of my work was motivated by collaborations with biologists that, for the first time, were collecting large amounts of data. Biology was changing from a data poor discipline to a data intensive

A specific example came from the measurement of gene expression. Gene expression is the process where DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. Before microarrays were invented in the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure 1A below). With microarrays, this suddenly changed to sifting through tens of thousands of numbers (see Figure 1B). Biologists went from using their eyes to categorize results to having thousands (and now millions) of measurements per sample to analyze. Furthermore, unlike genomic DNA, which is static, gene expression is a dynamic quantity: different tissues express different genes at different levels and at different times. The complexity was exacerbated by unpolished technologies that made measurements much noisier than anticipated. This complexity and level of variability made statistical thinking an important aspect of the analysis. The Biologists that used to say, "if I need statistics, the experiment went wrong" were now seeking out our help. The results of these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays making it possible to identify patients at risk of distant recurrence following surgery



Figure 1: Illustration of gene expression data before and after micorarrays.

When biologists at Hopkins first came to our department for help with their  microarray data, Scott put them in touch with me because I had experience with (what was then) large datasets (digitized music signals are represented by 44,100 points per second). The more I learned about the scientific problems and the more data I explored, the more motivated I became. The potential for statisticians having an impact in this nascent field was clear and my department was encouraging me to take the plunge. This institutional encouragement and support was crucial as successfully working in this field made it harder to publish in the mainstream statistical journals; an accomplishment that had traditionally been heavily weighted in the promotion process. The message was clear: having an immediate impact on
specific scientific fields would be rewarded as much as mathematically rigorous methods with general applicability.

As with my thesis applications, it was clear that to solve some of the challenges posed by microarray data I would have to learn all about the technology. For this I organized a sabbatical with Terry Speed's group in Melbourne where they helped me accomplish this goal. During this visit I reaffirmed my preference for attacking applied problems with simple statistical methods, as opposed to overcomplicated ones or developing new techniques. Learning that deciphering clever ways of putting the existing statistical toolbox to work was good enough for an accomplished statistician like Terry gave me the necessary confidence to continue working this way. More than a decade later this continues to be my approach to applied statistics. This approach has been instrumental for some of my current collaborative work. In particular, it led to important new biological discoveries made together with Andy Feinberg’s lab [7].

During my sabbatical we developed preliminary solutions that improved precision and aided in the removal of systematic biases for microarray data [6]. I was aware that hundreds, if not thousands, of other scientists were facing the same problematic data and were searching for solutions. Therefore I was also thinking hard about ways in which I could share whatever solutions I developed with others. During this time I received an email from Robert Gentleman asking if I was interested in joining a new software project for the delivery of statistical methods for genomics data. This new collaboration eventually became the Bioconductor project, (http://www.bioconductor.org) which to this day continues to grow its user and developer base [4]. Bioconductor was the perfect vehicle for having the impact that my department had encouraged me to seek. With Ben Bolstad and others we wrote an R package that has been downloaded tens of thousands of times [3]. Without the availability of software, the statistical method would not have received nearly as much attention. This lesson served me well throughout my career, as developing software packages has greatly helped disseminate my statistical ideas. The fact that my department and school rewarded software publications provided important support.

The impact statisticians have had in genomics is just one example of our fields accomplishment in the 21st century. In academia, the number of statistician becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. Outside of academia, Sabermetrics has become a standard approach in several sports (not just baseball) and inspired the Hollywood movie Money Ball. A PhD Statistician led the team that won the Netflix million dollar prize [http://www.netflixprize.com/]. Nate Silver http://mashable.com/2012/11/07/nate-silver-wins/ proved the pundits wrong by once again using statistical models to predict election results almost perfectly. R has become a widely used programming language. It is no surprise that Statistics majors at Harvard have more than quadrupled since 2000 http://nesterko.com/visuals/statconcpred2012-with-dm/ and that statistics MOOCs are among the most popular http://edudemic.com/2012/12/the-11-most-popular-open-online-courses/.

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. Scientific fields that have traditionally relied upon simple data analysis techniques have been turned on their heads by these technologies. Furthermore, advances such as these have brought about a shift from hypothesis to discovery-driven research. However, interpreting information extracted from these massive and complex datasets requires sophisticated statistical skills as one can easily be fooled by patterns that arise by chance. This has greatly elevated the importance of our discipline in biomedical research.

I think that the data revolution is just getting started. Datasets are currently being, or have already been, collected that contain, hidden in their complexity, important truths waiting to be discovered. These discoveries will increase the scientific understanding of our world. Statisticians should be excited and ready to play an important role in the new scientific renaissance driven by the measurement revolution.


[1]   NE Crone, L Hao, J Hart, D Boatman, RP Lesser, R Irizarry, and
B Gordon. Electrocorticographic gamma activity during word production
in spoken and sign language. Neurology, 57(11):2045–2053, 2001.

[2]   Janet A DiPietro, Rafael A Irizarry, Melissa Hawkins, Kathleen A
Costigan, and Eva K Pressman. Cross-correlation of fetal cardiac and
somatic activity as an indicator of antenatal neural development. American
journal of obstetrics and gynecology, 185(6):1421–1428, 2001.

[3]   Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A
Irizarry. affyanalysis of affymetrix genechip data at the probe level.
Bioinformatics, 20(3):307–315, 2004.

[4]   Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad,
Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao
Ge, Jeff Gentry, et al. Bioconductor: open software development for
computational biology and bioinformatics. Genome biology, 5(10):R80, 2004.

[5]   Rafael A Irizarry. Local harmonic estimation in musical sound signals.
Journal of the American Statistical Association, 96(454):357–367, 2001.

[6]   Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D
Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed.
Exploration, normalization, and summaries of high density oligonucleotide
array probe level data. Biostatistics, 4(2):249–264, 2003.

[7]   Rafael A Irizarry, Christine Ladd-Acosta, Bo Wen, Zhijin Wu, Carolina
Montano, Patrick Onyango, Hengmi Cui, Kevin Gabo, Michael Rongione,
Maree Webster, et al. The human colon cancer methylome shows similar
hypo-and hypermethylation at conserved tissue-specific cpg island shores.
Nature genetics, 41(2):178–186, 2009.

[8]   Rafael A Irizarry, Clarke Tankersley, Robert Frank, and Susan
Flanders. Assessing homeostasis through circadian patterns. Biometrics,
57(4):1228–1237, 2001.

[9]   Laura J van’t Veer, Hongyue Dai, Marc J Van De Vijver, Yudong D
He, Augustinus AM Hart, Mao Mao, Hans L Peterse, Karin van der Kooy,
Matthew J Marton, Anke T Witteveen, et al. Gene expression profiling
predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002.


Motivating statistical projects

It seems like half of the battle in statistics is identifying an important/unsolved problem. In math, this is easy, they have a list. So why is it harder for statistics? Since I have to think up projects to work on for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise.

I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram). The diagram shows the rough relationship of science, data, applied statistics, and theoretical statistics. Science produces data (although there are other sources), the data are analyzed using applied statistical methods, and theoretical statistics concerns the math behind statistical methods. The dotted line indicates that theoretical statistics ostensibly generalizes applied statistical methods so they can be applied in other disciplines. I do think that this type of generalization is becoming harder and harder as theoretical statistics becomes farther and farther removed from the underlying science.

Based on this diagram I see three major sources for statistical problems: 

  1. Theoretical statistical problems One component of statistics is developing the mathematical and foundational theory that proves we are doing sensible things. This type of problem often seems to be inspired by popular methods that exists/are developed but lack mathematical detail. Not surprisingly, much of the work in this area is motivated by what is mathematically possible or convenient, rather than by concrete questions that are of concern to the scientific community. This work is important, but the current distance between theoretical statistics and science suggests that the impact will be limited primarily to the theoretical statistics community. 
  2. Applied statistics motivated by convenient sources of data. The best example of this type of problem are the analyses in Freakonomics.  Since both big data and small big data are now abundant, anyone with a laptop and an internet connection can download the Google n-gram data, a microarray from GEO data about your city, or really data about anything and perform an applied analysis. These analyses may not be straightforward for computational/statistical reasons and may even require the development of new methods. These problems are often very interesting/clever and so are often the types of analyses you hear about in newspaper articles about “Big Data”. But they may often be misleading or incorrect, since the underlying questions are not necessarily well founded in scientific questions. 
  3. Applied statistics problems motivated by scientific problems. The final category of statistics problems are those that are motivated by concrete scientific questions. The new sources of big data don’t necessarily make these problems any easier. They still start with a specific question for which the data may not be convenient and the math is often intractable. But the potential impact of solving a concrete scientific problem is huge, especially if many people who are generating data have a similar problem. Some examples of problems like this are: can we tell if one batch of beer is better than another, how are quantitative characteristics inherited from parent to child, which treatment is better when some people are censored, how do we estimate variance when we don’t know the distribution of the data, or how do we know which variable is important when we have millions

So this leads back to the question, what are the biggest open problems in statistics? I would define these problems as the “high potential impact” problems from category 3. To answer this question, I think we need to ask ourselves, what are the most common problems people are trying to solve with data but can’t with what is available right now? Roger nailed this when he talked about the role of statisticians in the science club

Here are a few ideas that could potentially turn into high-impact statistical problems, maybe our readers can think of better ones?

  1. How do we credential students taking online courses at a huge scale?
  2. How do we communicate risk about personalized medicine (or anything else) to a general population without statistical training? 
  3. Can you use social media as a preventative health tool?
  4. Can we perform randomized trials to improve public policy?
Image Credits: The Science Logo is the old logo for the USU College of Science, the R is the logo for the R statistical programming language, the data image is a screenshot of Gapminder, and the theoretical statistics image comes from the Wikipedia page on the law of large numbers.

Edit: I just noticed this paper, which seems to support some of the discussion above. On the other hand, I think just saying lots of equations = less citations falls into category 2 and doesn’t get at the heart of the problem. 

Follow up on "Statistics and the Science Club"

I agree with Roger’s latest post: “we need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. 

A  quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way. Eventually, many of these users of Statistics will become developers and if we don’t keep up then what do they need from us? Our already-written books may be enough. In fact, in genomics, I know several people like this that are already developing novel statistical methods. I want these researchers to be part of our academic departments. Otherwise, I fear we will not be in touch with the problems and data that lead to, quoting Roger, “the most exciting developments of our lifetime.”