Category: Uncategorized

09
Apr

A blessing of dimensionality often observed in high-dimensional data sets

Tidy data sets have one observation per row and one variable per column.  Using this definition, big data sets can be either:

  1. Wide - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.
  2. Tall - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.

The curse of dimensionality tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was nicely illustrated by my student Prasad (although it looks like his quota may be up on Rstudio).

For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:

No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.

As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.

This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I'd make a very short list of some of these ideas:

1. Idea: De-convolving mixed observations from high-dimensional data. 

How the blessing plays a role: The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (Wenyi Wang came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).

2. Idea: The two groups model for false discovery rates.

How the blessing plays a role:  The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.

 

3. Idea: Empirical Bayes variance shrinkage for linear models

How the blessing plays a role:  A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement's estimates.

 

4. Idea: Surrogate variable analysis

How the blessing plays a role:  Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (see my JHU job talk for more on the blessing)

 

The blessing of dimensionality I'm describing here is related to the idea that Andrew Gelman refers to in this 2004 post.  Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get better estimates of the structure in your high-dimensional data - a nice blessing!

09
Apr

How to Get Ahead in Academia

This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom.

The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don't want to miss Scott Zeger's secret formula for getting promoted!

02
Apr

Why You Need to Study Statistics

The American Statistical Association is continuing its campaign to get you to study statistics, if you haven't already. I have to agree with them that being a statistician is a pretty good job. Their latest video highlights a wide range of statisticians working in industry, government, and academia. You can check it out here:

26
Mar

Teaser trailer for the Genomic Data Science Specialization on Coursera

 

We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the "Genomic Data Science Specialization" and includes a spectacular line up of instructors: Steven Salzberg, Ela Pertea, James Taylor, Liliana Florea, Kasper Hansen, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!

24
Mar

Introduction to Bioconductor HarvardX MOOC starts this Monday March 30

Bioconductor is one of the most widely used open source toolkits for biological high-throughput data. In this four week course, co-taught with Vince Carey and Mike Love, we will introduce you to Bioconductor's general infrastructure and then focus on two specific technologies: next generation sequencing and microarrays. The lectures and assessments will be annotated in case you want to focus only on one of these two technologies. Although if you plan to be a bioinformatician we recommend you learn both.

Topics covered include:

  • A short introduction to molecular biology and measurement technology
  • An overview on how to leverage the platform and genome annotation packages and experimental archives
  • GenomicsRanges: the infrastructure for storing, manipulating and analyzing next generation sequencing data
  • Parallel computing and cloud concepts
  • Normalization, preprocessing and bias correction.
  • Statistical inference in practice: including hierarchical models and gene set enrichment analysis
  • Building statistical analysis pipelines of genome-scale assays including the creation of reproducible reports

Throughout the class we will be using data examples from both next generation sequencing and microarray experiments.

We will assume basic knowledge of Statistics and R.

For more information visit the course website.

19
Mar

A surprisingly tricky issue when using genomic signatures for personalized medicine

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

An example of this normalization is to mean-center the measurements for each gene in the testing/application stage, then apply the prediction rule. The problem is that if you use a different set of samples when calculating the mean you can get a totally different prediction function. The basic problem is illustrated in this graphic.

 

Screen Shot 2015-03-19 at 12.58.03 PM

 

This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set. In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.

Screen Shot 2015-03-19 at 1.02.25 PM

 

This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.

18
Mar

A simple (and fair) way all statistics journals could drive up their impact factor.

Hypothesis:

If every method in every stats journal was implemented in a corresponding R package (easy), was required to have a  companion document that was a tutorial on how to use the software (easy), included a reference to how to cite the paper if you used the software (easy) and the paper/tutorial was posted to the relevant message boards for the communities of interest (easy) that journal would see a dramatic bump in its impact factor.

17
Mar

Data science done well looks easy - and that is a big problem for data scientists

Data science has a ton of different definitions. For the purposes of this post I'm going to use the definition of data science we used when creating our Data Science program online. Data science is:

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

In general the data science process is iterative and the different components blend together a little bit. But for simplicity lets discretize the tasks into the following 7 steps:

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results
  7. Make your analysis reproducible

A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst's time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don't need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist's usually instead try to go back and get the right data.

The result of this process is that most well executed and successful data science projects don't (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I've evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren't quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn't know about the data collection and cleaning process or doesn't care.

This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think "oh that is easy" or "that is just a trivial statistical/machine learning model" and don't see all of the work that goes into solving the real problems in data science. A concrete example of this is in academic statistics. It is customary for people to show theorems in their talks and maybe even some of the proof. This gives people working on theoretical projects an opportunity to "show their stuff" and demonstrate how good they are. The equivalent for a data scientist would be showing how they found and cleaned multiple data sets, merged them together, checked for biases, and arrived at a simplified data set. Showing the "proof" would be equivalent to showing how they matched IDs. These things often don't look nearly as impressive in talks, particularly if the audience doesn't have experience with how incredibly delicate real data analysis is. I imagine versions of this problem play out in industry as well (candidate X did a good analysis but it wasn't anything special, candidate Y used Hadoop to do BIG DATA!).

The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really "hard" and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc.  In the meantime, don't be fooled by what looks like simple data science - it can often be pretty effective.

 

Editor's note: If you like this post, you might like my pay-what-you-want book Elements of Data Analytic Style: https://leanpub.com/datastyle

 

14
Mar

π day special: How to use Bioconductor to find empirical evidence in support of π being a normal number

Editor's note: Today 3/14/15 at some point between  9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. 

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.

pi

Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.

pi2

These observed counts should also be independent, and to explore this we can look at autocorrelation plots:

piacf

To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.


library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
if(d<5){
patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
} else{
patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
}
res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
z <- (res - n*p ) / sqrt( n*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
abline(0,1)
##correction: original post had length(res)
if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1))
}
###Now count in segments
d <- 1
m <-10^5
patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
tmp<-start(matchPattern(pat,x))
tmp2<-floor( (tmp-1)/m)
return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
z <- (res[,i] - m*p) / sqrt( m*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.

13
Mar

De-weaponizing reproducibility

A couple of weeks ago Roger and I went to a conference on statistical reproducibility held at the National Academy of Sciences. The discussion was pretty wide ranging and I love that the thinking about reproducibility is coming back to statistics. There was pretty widespread support for the idea that prevention is the right way to approach reproducibility.
It turns out I was the last speaker of the whole conference. This is an unenviable position to be in with so many bright folks speaking first as they covered a huge amount of what I wanted to say. My talk focused on three key points:
  1. The tools for reproducibility already exist, the barrier isn't tools
  2. We need to de-weaponize reproducibility
  3. Prevention is the right approach to reproducibility

 

In terms of the first point, tools like iPython, knitr, and Galaxy can be used to all but the absolutely largest analysis reproducible right now.  Our group does this all the time with our papers and so do many others. The problem isn't a lack of tools.

Speaking to point two, I think many people would agree that part of the issue is culture change. One issue that is increasingly concerning to me is the "weaponization" of reproducibility.  I have been noticing is that some of us (like me, my students, other folks at JHU, and lots of particularly junior computational people elsewhere) are trying really hard to be reproducible. Most of the time this results in really positive reactions from the community. But when a co-author of mine and I wrote that paper about the science-wise false discovery rate, one of the discussants used our code (great), improved on it (great), identified a bug (great), and then did his level best to humiliate us both in front of the editor and the general public because of that bug (not so great).

I have seen this happen several times. Most of the time if a paper is reproducible the authors get a pat on the back and their code is either ignored, or used in a positive way. But for high-profile and important problems, people  largely use eproducibility to:
  1.  Impose regulatory hurdles in the short term while people transition to reproducibility. One clear example of this is the Secret Science Reform Act which is a bill that imposes strict reproducibility conditions on all science before it can be used as evidence for regulation.
  2. Humiliate people who aren't good coders or who make mistakes in their code. This is what happened in my paper when I produced reproducible code for my analysis, but has also happened to other people.
  3. Take advantage of people's code to plagiarize/straight up steal work. I have stories about this I'd rather not put on the internet

 

Of the three, I feel like (1) and (2) are the most common. Plagiarism and scooping by theft I think are actually relatively rare based on my own anecdotal experience. But I think that the "weaponization" of reproducibility to block regulation or to humiliate folks who are new to computational sciences is more common than I'd like it to be. Until reproducibility is the standard for everyone - which I think is possible now and will happen as the culture changes -  the people who are the early adopters are at risk of being bludgeoned with their own reproducibility. As a community, if we want widespread reproducibility adoption we have to be ferocious about not allowing this to happen.