Sunday data/statistics link roundup (6/16/13 - Father's day edition!)

  1. Datapalooza! I'm wondering where my invite is? I do health data stuff, pick me, pick me! Actually it does sound like a pretty good idea - in general giving a bunch of smart people access to interesting data and real science problems can produce some cool results (link via Dan S.)
  2. This report on precision medicine from the Manhattan Institute is related to my post this week on personalized medicine. I like the idea that we should be focusing on developing new ideas for adaptive trials (my buddy Michael is all over that stuff). I did thing that it was a little pie-in-the-sky with plenty of buzzwords like Bayesian causal networks and pattern recognition. I think these ideas are certainly applicable, but the report, I think, overstates the current level of applicability of these methods. We need more funding and way more research to support this area before we should automatically adopt it - big data can be used to confuse when methods aren't well understood (link via Rafa via Marginal Revolution).
  3. rOpenSci wins a grant from the Sloan Foundation! Psyched to see this kind of innovative open software development get the support it deserves. My favorite rOpenSci package is rFigshare, what's yours?
  4. A k-means approach to detecting what will be trending on Twitter. It always gets me so pumped up to see the creative ways that methods that have been around forever can be adapted to solve real, interesting problems.
  5. Finally, I thought this link was very appropriate for father's day. I couldn't agree more that the best kind of learning happens when you are just so in to something that you forget you are learning. Happy father's day everyone!
Posted in Uncategorized | 1 Comment

The vast majority of statistical analysis is not performed by statisticians

Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. The only data that were collected were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy (see e.g. Camp Williams).

To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. This is where the idea of “6-degrees of Kevin Bacon” comes from. Based on 64 data points.  A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort.

Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome. This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $10,000 in about a week.

 As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. It started with maps of temperatures across the U.S. in newspapers and quickly ramped up to information on how many friends you had on Facebook, the price of tickets on 50 airlines for the same flight, or measurements of your blood pressure, good cholesterol, and bad cholesterol at every doctor’s visit. Arguments about politics started focusing on the results of opinion polls and who was asking the questions. The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.

That is when statisticians stopped being the primary data analysts. At some point, the trickle of data about you, your friends, and the world started impacting every component of your life. Now almost every decision you make is based on data you have about the world around you. Let’s take something simple, like where are you going to eat tonight. You might just pick the nearest restaurant to your house. But you could also ask your friends on Facebook where you should eat, or read reviews on Yelp, or check out menus on the restaurants websites. All of these are pieces of data that are collected and presented for you to "analyze".

This revolution demands a new way of thinking about statistics. It has precipitated explosive growth in data visualization - the most accessible form of data analysis. It has encouraged explosive growth in MOOCs like the ones Roger, Brian and I taught. It has created open data initiatives in government. It has also encouraged more accessible data analysis platforms in the form of startups like StatWing that make it easier for non-statisticians to analyze data.

What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

Posted in Uncategorized | 8 Comments

False discovery rate regression (cc NSA's PRISM)

There is an idea I have been thinking about for a while now. It re-emerged at the top of my list after seeing this really awesome post on using metadata to identify "conspirators" in the American revolution. My first thought was: but how do you know that you aren't just making lots of false discoveries?

Hypothesis testing and significance analysis were originally developed to make decisions for single hypotheses. In many modern applications, it is more common to test hundreds or thousands of hypotheses. In the standard multiple testing framework, you perform a hypothesis test for each of the "features" you are studying (these are typically genes or voxels in high-dimensional problems in biology, but can be other things as well). Then the following outcomes are possible:

Call Null True Call Null False Total
Null True True Negatives False Positives True Nulls
Null False False Negatives True Positives False Nulls
No Decisions Rejections

The reason for "No Decisions" is that the way hypothesis testing is set up, one should technically never accept the null hypothesis. The number of rejections is the total number of times you claim that a particular feature shows a signal of interest.

A very common measure of embarrassment in multiple hypothesis testing scenarios is the false discovery rate defined as:

 FDR = E\left[\frac{\# of False Positives}{\# of Rejections}\right]

.

There are some niceties that have to be dealt with here, like the fact that the \# of Rejections may be equal to zero, inspiring things like the positive false discovery rate, which has some nice Bayesian interpretations.

The way that the process usually works is that a test statistic is calculated for each hypothesis test where a larger statistic means more significant and then operations are performed on these ordered statistics. The two most common operations are: (1) pick a cutoff along the ordered list of p-values - call everything less than this threshold significant and estimate the FDR for that cutoff and (2) pick an acceptable FDR level and find an algorithm to pick the threshold that controls the FDR where control is defined usually by saying something like the algorithm produces E[FDP] \leq FDR .

Regardless of the approach these methods usually make an assumption that the rejection regions should be nested. In other words, if you call statistic $T_k$ significant and $T_j > T_k$ then your method should also call statistic $T_j$ significant. In the absence of extra information, this is a very reasonable assumption.

But in many situations you might have additional information you would like to use in the decision about whether to reject the null hypothesis for test $j$.

Example 1 A common example is gene-set analysis. Here you have a group of hypotheses that you have tested individually and you want to say something about the level of noise in the group. In this case, you might want to know something about the level of noise if you call the whole set interesting.

Example 2 Suppose you are a mysterious government agency and you want to identify potential terrorists. You observe some metadata on people and you want to predict who is a terrorist - say using betweenness centrality. You could calculate a P-value for each individual, say using a randomization test. Then estimate your FDR based on predictions using the metadata.

Example 3 You are monitoring a system over time where observations are random. Say for example whether there is an outbreak of a particular disease in a particular region at a given time. So, is the rate of disease higher than background. How can you estimate the rate at which you make false claims?

For now I'm going to focus on the estimation scenario but you could imagine using these estimates to try to develop controlling procedures as well.

In each of these cases you have a scenario where you are interested in something like:

 E\left[\frac{V}{R} | X=x\right] = fdr(x)

where fdr(x) is a covariate-specific estimator of the false discovery rate. Returning to our examples you could imagine:

Example 1

 E\left[\frac{V}{R} | GS = k\right] =\beta_0 + \sum_{\ell=1}^K\beta_{\ell} 1(GS=\ell)

Example 2

 E\left[\frac{V}{R} | Person , Age\right] =\beta_0 + \gamma Age + \sum_{\ell=1}^K\beta_{\ell}1(Person = \ell)

Example 3

 E\left[\frac{V}{R} | Time \right] =\beta_0 + \sum_{\ell =1}^{K} s_{\ell}(time)

Where in the last case, we have parameterized the relationship between FDR and time with a flexible model like cubic splines.

The hard problem is fitting the regression models in Examples 1-3. Here I propose a basic estimator of the FDR regression model and leave it to others to be smart about it. Let's focus on P-values because they are the easiest to deal with. Suppose that we calculate the random variables Y_i = 1(P_i > \lambda) . Then:

 E[Y_i] = Prob(P_i > \lambda) = (1-\lambda)*\pi_0 + (1-G(\lambda))*(1-\pi_0)

Where $G(\lambda)$ is the empirical distribution function for the P-values under the alternative hypothesis. This may be a mixture distribution. If we assume reasonably powered tests and that $\lambda$ is large enough, then G(\lambda) \approx 1 . So

 E[Y_i] \approx (1-\lambda) \pi_0

One obvious choice is then to try to model

 E[Y_i | X = x] \approx (1-\lambda) \pi_0(x)

We could, for example use the model:

 logit(E[Y_i | X = x]) = f(x)

where f(x) is a linear model or spline, etc. Then we get the fitted values and calculate:

\hat{\pi}_0(x) = \hat{E}[Y_i | X=x] /(1-\lambda)

Here is a little simulated example where the goal is to estimate the probability of being a false positive as a smooth function of time.


## Load libraries

library(splines)
## Define the number of tests
set.seed(1345)
ntest <- 1000

## Set up the time vector and the probability of being null
tme <- seq(-2,2,length=ntest)
pi0 <- pnorm(tme)

## Calculate a random variable indicating whether to draw
## the p-values from the null or alternative
nullI <- rbinom(ntest,prob=pi0,size=1)> 0

## Sample the null P-values from U(0,1) and the alternatives
## from a beta distribution

pValues <- rep(NA,ntest)
pValues[nullI] <- runif(sum(nullI))
pValues[!nullI] <- rbeta(sum(!nullI),1,50)

## Set lambda and calculate the estimate

lambda <- 0.8
y <- pValues > lambda
glm1 <- glm(y ~ ns(tme,df=3))

## Get the estimate pi0 values
pi0hat <- glm1$fitted/(1-lambda)

## Plot the real versus fitted probabilities

plot(pi0,pi0hat,col="blue",type="l",lwd=3,xlab="Real pi0",ylab="Fitted pi0")
abline(c(0,1),col="grey",lwd=3)

The result is this plot:
pi0

Real versus estimated false discovery rate when calling all tests significant.

This estimate is obviously not guaranteed to estimate the FDR well, the operating characteristics both theoretically and empirically need to be evaluated and the other examples need to be fleshed out. But isn't the idea of FDR regression cool?

Posted in Uncategorized | 1 Comment

Personalized medicine is primarily a population-health intervention

There has been a lot of discussion of personalized medicine, individualized health, and precision medicine in the news and in the medical research community. Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to "personalize" healthcare for those individuals.

So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that has recently been in the news is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.

This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:

  1. In individualized health/personalized medicine the "treatment" is information about risk. In some cases treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be "personalized" in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of understanding uncertainty.
  2. Individualized health/personalized medicine is a population-level treatment. Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the "personal" decision may not always be the "best" decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.
Posted in Uncategorized | Leave a comment

Why not have a "future of the field" session at a conference with only young speakers?

I'm in the process of trying to get together a couple of sessions to submit to ENAR 2014. I'm pretty psyched about the topics and am looking forward to hosting the conference in Baltimore. It is pretty awesome to have one of the bigger stats conferences on our home turf and we are going to try to be well represented at the conference.

While putting the sessions together I've been thinking about what are my favorite characteristics of sessions at stats conferences. Alyssa has a few suggestions for speakers which I'm completely in agreement with, but I'm talking about whole sessions. Since statistics is often concerned primarily with precision/accuracy the talks tend to be a little bit technical and sometimes dry. Even on topics I really am excited about, people try not to exaggerate. I think overall this is a great quality, but I'd prefer to be entertained at a conference. I realized that one of my favorite kind of sessions is the "future of statistics" session.

My only problem is that future of the field talks are always given by luminaries who have a lot of experience. This isn't surprising, since (1) they are famous and their names are a big draw, (2) they have made lots of interesting/unique contributions, and (3) they are established so they don't have to worry about being a little imprecise.

But I'd love to see a "future of the field" session with only people who are students/postdocs/first year assistant professors. These are the people who will really be the future of the field and are often more on top of new trends. It would be so cool to see four or five of the most creative young people in the field making bold predictions about where we will go as a discipline. Then you could have one senior person discuss the talks and give some perspective on how realistic the visions would be in light of past experience.

Tell me that wouldn't be an awesome conference session.

 

Posted in Uncategorized | 4 Comments

Sunday data/statistics link roundup (6/2/13)

  1. Awesome, a GUI for d3 graphs. Via John M.
  2. Tom L. on why statistics matter, especially at the Census!
  3. I've been spending the last several weeks house hunting like crazy, so the idea of data on schools is high on my mind right now. So this link to data on geography of school attendance in DC seemed particularly interesting (via Rafa).
  4. A student dramatically reduces the cost of the self-driving car. The big technological breakthrough? Sampling! (via Marginal Revolution).
Posted in Uncategorized | 2 Comments

What statistics should do about big data: problem forward not solution backward

There has been a lot of discussion among statisticians about big data and what statistics should do to get involved. Recently Steve M. and Larry W. took up the same issue on their blog. I have been thinking about this for a while, since I work in genomics, which almost always comes with "big data". It is also one area of big data where statistics and statisticians have played a huge role.

A question that naturally arises is, "why have statisticians been so successful in genomics?" I think a major reason is the phrase I borrowed from Brian C. (who may have borrowed it from Ron B.)

problem first, not solution backward

One of the reasons that "big data" is even a term is that there is that data are less expensive than they were a few years ago. One example is the dramatic drop in the price of DNA-sequencing. But there are many many more examples. The quantified self movement and Fitbits, Google Books, social network data from Twitter, etc. are all areas where data that cost us a huge amount to collect 10 years ago can now be collected and stored very cheaply.

As statisticians we look for generalizable principles; I would say that you have to zoom pretty far out to generalize from social networks to genomics but here are two:

  1. The data can't be easily analyzed in an R session on a simple laptop (say low Gigs to Terabytes)
  2. The data are generally quirky and messy (unstructured text, json files with lots of missing data, fastq files with quality metrics, etc.)

So how does one end up at the "leading edge" of big data? By being willing to deal with the schlep and work out the knitty gritty of how you apply even standard methods to data sets where taking the mean takes hours. Or taking the time to learn all the kinks that are specific to say, how does one process a microarray, and then taking the time to fix them. This is why statisticians were so successful in genomics, they focused on the practical problems and this gave them access to data no one else had/could use properly.

Doing these things requires a lot of effort that isn't elegant. It also isn't "statistics" by the definition that only mathematical methodology is statistics. Steve alludes to this in his post when he says:

Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning).

I think this is a view shared by many statisticians. That since there isn't a new elegant theory yet, there aren't "new ideas" in big data. That focus is solution backward. We want an elegant theory that we can then apply to specific problems if they happen to come up.

The alternative is problem forward. The fact that we can collect data so cheaply means we can measure and study things we never could before. Computer scientists, physicists, genome biologists, and others are leading in big data precisely because they aren't thinking about the statistical solution. They are thinking about solving an important scientific problem and are willing to deal with all the dirty details to get there. This allows them to work on data sets and problems that haven't been considered by other people.

In genomics, this has happened before. In that case, the invention of microarrays revolutionized the field and statisticians jumped on board, working closely with scientists, handling the dirty details, and building software so others could too. As a discipline if we want to be part of the "big data" revolution I think we need to focus on the scientific problems and let methodology come second. That requires a rethinking of what it means to be statistics. Things like parallel computing, data munging, reproducibility, and software development have to be accepted as equally important to methods development.

The good news is that there is plenty of room for statisticians to bring our unique skills in dealing with uncertainty to these new problems; but we will only get a seat at the table if we are willing to deal with the mess that comes with doing real science.

I'll close by listing a few things I'd love to see:

  1. A Bioconductor-like project for social network data. Tyler M. and Ali S. have a paper that would make for an awesome package for this project. 
  2. Statistical pre-processing for fMRI and other brain imaging data. Keep an eye on our smart group for that.
  3. Data visualization for translational applications, dealing with all the niceties of human-data interfaces. See healthvis or the stuffy Miriah Meyer is doing.
  4. Most importantly, starting with specific, unsolved scientific problems. Seeking novel ways to collect cheap data, and analyzing them, even with known and straightforward statistical methods to deepen our understanding about ourselves or the universe.
Posted in Uncategorized | 12 Comments

Sunday data/statistics link roundup (5/19/2013)

  1. This is a ridiculously good post on 20th versus 21st century problems and the rise of the importance of empirical science. I particularly like the discussion of what it means to be a "solved" problem and how that has changed.
  2. A discussion in Science about the (arguably) most important statistics among academics, the impact factor and h-index. This comes on the heels of the San Francisco Declaration of Research Assessment. I like the idea that we should focus on evaluating science for its own merit rather than focusing on summaries like impact factor. But I worry that the "gaming" people are worried about with quantitative numbers like IF will be replaced with "politicking" if it becomes too qualitative. (via Rafa)
  3. A write-up about a survey  in Britain that suggests people don't believe statistics (surprise!). I think this is symptomatic of a bigger issue which is being raised over and over. In the era when scientific problems don't have deterministic solutions how do we determine if a problem has been solved? There is no good answer for this yet and it threatens to undermine a major fraction of the scientific enterprise going forward.
  4. Businesses are confusing data analysis and big data. This is so important and true. Big data infrastructure is often critical for creating/running data products. But discovering new ideas from data often happens on much smaller data sets with good intuition and interactive data analysis.
  5. Really interesting article about how the baseball card numbering system matters and how changing it can upset collectors (via Chris V.).
Posted in Uncategorized | 1 Comment

When does replication reveal fraud?

Here's a little thought experiment for your weekend pleasure. Consider the following:

Joe Scientist decides to conduct a study (call it Study A) to test the hypothesis that a parameter D > 0 vs. the null hypothesis that D = 0. He designs a study, collects some data, conducts an appropriate statistical analysis and concludes that D > 0. This result is published in the Journal of Awesome Results along with all the details of how the study was done.

Jane Scientist finds Joe's study very interesting and tries to replicate his findings. She conducts a study (call it Study B) that is similar to Study A but completely independent of it (and does not communicate with Joe). In her analysis she does not find strong evidence that D > 0 and concludes that she cannot rule out the possibility that D = 0. She publishes her findings in the Journal of Null Results along with all the details.

From these two studies, which of the following conclusions can we make?

  1. Study A is obviously a fraud. If the truth were that D > 0, then Jane should have concluded that D > 0 in her independent replication.
  2. Study B is obviously a fraud. If Study A were conducted properly, then Jane should have reached the same conclusion.
  3. Neither Study A nor Study B was a fraud, but the result for Study A was a Type I error, i.e. a false positive.
  4. Neither Study A nor Study B was a fraud, but the result for Study B was a Type II error, i.e a false negative.

I realize that there are a number of subtle details concerning why things might happen but I've purposely left them out. My question is, based on the information that you actually have about the two studies, what would you consider to be the most likely case? What further information would you like to know beyond what was given here?

Posted in Uncategorized | 19 Comments

The bright future of applied statistics

In 2013, the Committee of Presidents of Statistical Societies (COPSS) celebrates its 50th Anniversary. As part of its celebration, COPSS will publish a book, with contributions from past recipients of its awards, titled “Past, Present and Future of Statistical Science". Below is my contribution titled The bright future of applied statistics.

When I was asked to contribute to this issue, titled Past, Present, and Future of Statistical Science, I contemplated my career while deciding what to write about. One aspect that stood out was how much I benefited from the right circumstances. I came to one clear conclusion: it is a great time to be an applied statistician. I decided to describe the aspects of my career that I have thoroughly enjoyed in the past and present and explain why I this has led me to believe that the future is bright for applied statisticians.

I became an applied statistician while working with David Brillinger on my PhD thesis. When searching for an advisor I visited several professors and asked them about their interests. David asked me what I liked and all I came up with was "I don't know. Music?", to which he responded "That's what we will work on". Apart from the necessary theorems to get a PhD from the Statistics Department at Berkeley, my thesis summarized my collaborative work with researchers at the Center for New Music and Audio Technology. The work
involved separating and parameterizing the harmonic and non-harmonic components of musical sound signals [5]. The sounds had been digitized into data. The work was indeed fun, but I also had my first glimpse into the incredible potential of statistics in a world becoming more and more data-driven.

Despite having expertise only in music, and a thesis that required a CD player to hear the data, fitted models and residuals (http://www.biostat.jhsph.edu/~ririzarr/Demo/index.html), I was hired by the Department of Biostatistics at Johns Hopkins School of Public Health. Later I realized what was probably obvious to the School’s leadership: that regardless of the subject matter of my thesis, my time series expertise could be applied to several public health applications [821]. The public health and biomedical challenges surrounding me were simply too hard to resist and my new
department knew this. It was inevitable that I would quickly turn into an applied Biostatistician.

Since the day that I arrived at Hopkins 15 years ago, Scott Zeger, the department chair, fostered and encouraged faculty to leverage their statistical expertise to make a difference and to have an immediate impact in science. At that time, we were in the midst of a measurement revolution that was transforming several scientific fields into data-driven ones. By being located in a School of Public Health and next to a medical school, we were surrounded by collaborators working in such fields. These included environmental science, neuroscience, cancer biology, genetics, and molecular biology. Much of my work was motivated by collaborations with biologists that, for the first time, were collecting large amounts of data. Biology was changing from a data poor discipline to a data intensive
ones.

A specific example came from the measurement of gene expression. Gene expression is the process where DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. Before microarrays were invented in the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure 1A below). With microarrays, this suddenly changed to sifting through tens of thousands of numbers (see Figure 1B). Biologists went from using their eyes to categorize results to having thousands (and now millions) of measurements per sample to analyze. Furthermore, unlike genomic DNA, which is static, gene expression is a dynamic quantity: different tissues express different genes at different levels and at different times. The complexity was exacerbated by unpolished technologies that made measurements much noisier than anticipated. This complexity and level of variability made statistical thinking an important aspect of the analysis. The Biologists that used to say, "if I need statistics, the experiment went wrong" were now seeking out our help. The results of these collaborations have led to, among other things, the development of breast cancer recurrence gene expression assays making it possible to identify patients at risk of distant recurrence following surgery

[9].

expression

Figure 1: Illustration of gene expression data before and after micorarrays.

When biologists at Hopkins first came to our department for help with their  microarray data, Scott put them in touch with me because I had experience with (what was then) large datasets (digitized music signals are represented by 44,100 points per second). The more I learned about the scientific problems and the more data I explored, the more motivated I became. The potential for statisticians having an impact in this nascent field was clear and my department was encouraging me to take the plunge. This institutional encouragement and support was crucial as successfully working in this field made it harder to publish in the mainstream statistical journals; an accomplishment that had traditionally been heavily weighted in the promotion process. The message was clear: having an immediate impact on
specific scientific fields would be rewarded as much as mathematically rigorous methods with general applicability.

As with my thesis applications, it was clear that to solve some of the challenges posed by microarray data I would have to learn all about the technology. For this I organized a sabbatical with Terry Speed's group in Melbourne where they helped me accomplish this goal. During this visit I reaffirmed my preference for attacking applied problems with simple statistical methods, as opposed to overcomplicated ones or developing new techniques. Learning that deciphering clever ways of putting the existing statistical toolbox to work was good enough for an accomplished statistician like Terry gave me the necessary confidence to continue working this way. More than a decade later this continues to be my approach to applied statistics. This approach has been instrumental for some of my current collaborative work. In particular, it led to important new biological discoveries made together with Andy Feinberg’s lab [7].

During my sabbatical we developed preliminary solutions that improved precision and aided in the removal of systematic biases for microarray data [6]. I was aware that hundreds, if not thousands, of other scientists were facing the same problematic data and were searching for solutions. Therefore I was also thinking hard about ways in which I could share whatever solutions I developed with others. During this time I received an email from Robert Gentleman asking if I was interested in joining a new software project for the delivery of statistical methods for genomics data. This new collaboration eventually became the Bioconductor project, (http://www.bioconductor.org) which to this day continues to grow its user and developer base [4]. Bioconductor was the perfect vehicle for having the impact that my department had encouraged me to seek. With Ben Bolstad and others we wrote an R package that has been downloaded tens of thousands of times [3]. Without the availability of software, the statistical method would not have received nearly as much attention. This lesson served me well throughout my career, as developing software packages has greatly helped disseminate my statistical ideas. The fact that my department and school rewarded software publications provided important support.

The impact statisticians have had in genomics is just one example of our fields accomplishment in the 21st century. In academia, the number of statistician becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. Outside of academia, Sabermetrics has become a standard approach in several sports (not just baseball) and inspired the Hollywood movie Money Ball. A PhD Statistician led the team that won the Netflix million dollar prize [http://www.netflixprize.com/]. Nate Silver http://mashable.com/2012/11/07/nate-silver-wins/ proved the pundits wrong by once again using statistical models to predict election results almost perfectly. R has become a widely used programming language. It is no surprise that Statistics majors at Harvard have more than quadrupled since 2000 http://nesterko.com/visuals/statconcpred2012-with-dm/ and that statistics MOOCs are among the most popular http://edudemic.com/2012/12/the-11-most-popular-open-online-courses/.

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. Scientific fields that have traditionally relied upon simple data analysis techniques have been turned on their heads by these technologies. Furthermore, advances such as these have brought about a shift from hypothesis to discovery-driven research. However, interpreting information extracted from these massive and complex datasets requires sophisticated statistical skills as one can easily be fooled by patterns that arise by chance. This has greatly elevated the importance of our discipline in biomedical research.

I think that the data revolution is just getting started. Datasets are currently being, or have already been, collected that contain, hidden in their complexity, important truths waiting to be discovered. These discoveries will increase the scientific understanding of our world. Statisticians should be excited and ready to play an important role in the new scientific renaissance driven by the measurement revolution.

Bibliography

[1]   NE Crone, L Hao, J Hart, D Boatman, RP Lesser, R Irizarry, and
B Gordon. Electrocorticographic gamma activity during word production
in spoken and sign language. Neurology, 57(11):2045–2053, 2001.

[2]   Janet A DiPietro, Rafael A Irizarry, Melissa Hawkins, Kathleen A
Costigan, and Eva K Pressman. Cross-correlation of fetal cardiac and
somatic activity as an indicator of antenatal neural development. American
journal of obstetrics and gynecology, 185(6):1421–1428, 2001.

[3]   Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A
Irizarry. affyanalysis of affymetrix genechip data at the probe level.
Bioinformatics, 20(3):307–315, 2004.

[4]   Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad,
Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao
Ge, Jeff Gentry, et al. Bioconductor: open software development for
computational biology and bioinformatics. Genome biology, 5(10):R80, 2004.

[5]   Rafael A Irizarry. Local harmonic estimation in musical sound signals.
Journal of the American Statistical Association, 96(454):357–367, 2001.

[6]   Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D
Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed.
Exploration, normalization, and summaries of high density oligonucleotide
array probe level data. Biostatistics, 4(2):249–264, 2003.

[7]   Rafael A Irizarry, Christine Ladd-Acosta, Bo Wen, Zhijin Wu, Carolina
Montano, Patrick Onyango, Hengmi Cui, Kevin Gabo, Michael Rongione,
Maree Webster, et al. The human colon cancer methylome shows similar
hypo-and hypermethylation at conserved tissue-specific cpg island shores.
Nature genetics, 41(2):178–186, 2009.

[8]   Rafael A Irizarry, Clarke Tankersley, Robert Frank, and Susan
Flanders. Assessing homeostasis through circadian patterns. Biometrics,
57(4):1228–1237, 2001.

[9]   Laura J van’t Veer, Hongyue Dai, Marc J Van De Vijver, Yudong D
He, Augustinus AM Hart, Mao Mao, Hans L Peterse, Karin van der Kooy,
Matthew J Marton, Anke T Witteveen, et al. Gene expression profiling
predicts clinical outcome of breast cancer. nature, 415(6871):530–536, 2002.

Posted in Uncategorized | Tagged , | Leave a comment