Writing good software can have more impact than publishing in high impact journals for genomic statisticians

Every once in a while we see computational papers published in science journals with high impact factors.  Genomics related methods appear quite often in these journals. Several of my junior colleagues express frustration that all their papers get rejected from these journals. I tell them that the same is true for most of my papers and remind them of these examples:

Method Journal Year #Citations
PLINK AJHG 2007 6481
Bioconductor Genome Biology 2004 5973
RMA Biostatistics 2003 5674
limma SAGMB 2004 5637
quantile normalization Bioinformatics 2003 4646
Bowtie Genome Biology 2009 3849
BWA Bioinformatics 2009 3327
Loess normalization NAR 2002 3313
qvalues JRSS-B 2002 2758
tophat Bioinformatics 2008 1868
vsn Bioinformatics 2002 1398
GCRMA JASA 2004 1397
MACS Genome Biology 2008 1277
deseq Genome Biology 2010 1264
CBS Biostatistics 2004 1051
R/qtl Bioinformatics 2003 1027

Let me know of other examples in the comments.
update: I added one more to the list.

Posted in Uncategorized | 4 Comments

This is how an important scientific debate is being used to stop EPA regulation

Environmental regulation in the United States has protected human health for over 40 years. Since the Clean Air Act was enacted in 1970, levels of outdoor air pollution have dropped dramatically, changing the landscape of once heavily-polluted cities like Los Angeles and Pittsburgh. A 2011 cost-benefit analysis conducted by the U.S. Environmental Protection Agency estimated that the 1990 amendments to the CAA prevented 160,000 deaths and 13 million lost work days in the year 2010 alone. They estimated that the monetary benefits of the CAA were 30 times greater than the costs of implementing the regulations.

The benefits of environmental regulations like the CAA significantly outweigh their costs. But there are still costs, and those costs must be borne by someone. The burden is usually put on the polluters, such as the automobile and power generation industries, which have long fought any notion of air pollution regulation as a threat to their existence. Initially, as air pollution and health studies were still emerging, opponents of regulation often challenged the science itself, claiming flaws in the methodology, the measurements, or the interpretation. But when study after study demonstrated a connection between outdoor air pollution and a variety of health problems, it became increasingly difficult for critics to mount a credible challenge. Lawsuits are another tactic used by industry, with one case brought by the American Trucking Association going all the way to the U.S. Supreme Court.

The latest attack comes from the House of Representatives in the form of the Secret Science Reform Act, or H.R. 4102. In summary, the proposed bill requires that every scientific paper cited by the EPA to justify a new rule or regulation needs to be reproducible. What exactly does this mean? To answer that question we need to take a brief diversion into some recent important developments in statistical science.

The idea behind reproducibility is simple. All the data used in a scientific paper and all the computer code used to analyze that data should be made available to other researchers and the public. It may be surprising that much of this data actually isn’t already available. The primary reason most data isn’t available is because, until recently, most people didn’t ask scientists for their data. The data was often small and collected for a specific purpose so other scientists and the general public just weren’t that interested. If a scientist were interested in checking the truth of a claim, she could simply repeat the experiment in her lab to see if the claim could be replicated.

The nature of science has changed quickly over the last three decades. There has been an explosion of data, fueled by the decreasing cost of data collection technologies and computing power. At the same time, increased access to sophisticated computing power has let scientists conduct more sophisticated analyses on their data. The massive growth in data and the increasing sophistication of the analyses has made communicating what was done in a scientific study more complicated.

The traditional medium of journal publications has proven to be inadequate for describing the important details of a data analysis. As a result, it has been said that scientific articles are merely the “advertising” for the research that was conducted. The real research is buried in the data and the computer code actually used to compute the results. Journals have traditionally not required that data or computer code be published along with papers. As a result, many important details may be lost and prevent key studies from being fully reproducible.

The explosion of data has also made completely replicating a large study by an independent scientist much more difficult and costly. A large study is expensive to conduct in the first place; there is usually little appetite or funding to repeat it.  The result is that much of published scientific research cannot be reproduced by other scientists because the necessary data and analytic details are not available to others.

The scientific community is currently engaged in a debate over how to improve reproducibility across all of science. You might be tempted to ask, why not just share the data? Even if we could get everyone to agree with that in principle, it’s not clear how to do it.

Imagine if everyone in the U.S. decided we were all going to share our movie collections, and suppose for the sake of this example that the movie industry did not object. How would it work? Numerous questions immediately arise. Where would all these movies be stored? How would they be transferred from one person to another? How would I know what movies everyone else had? If my movies are all on the old DVD format, do I need to convert them to some other format before I can share? My Internet connection is very slow, how can I download a 3 hour HD movie? My mother doesn’t use computers much, but she has a great movie collection that I think others should have access to. What should she do? And who is going to pay for all of this? While each question may have a reasonable answer, it’s not clear what is the optimal combination and how you might scale it to the entire country.

Some of you may recall that the music industry had a brilliant sharing service that essentially allowed everyone to share their music collections. It was called Napster. Napster solved many of the problems raised above except for one -- they failed to survive. So even when a decent solution is found, there’s no guarantee that it will always be there.

As outlandish as this example may seem, minor variations on these exact questions come up when we discuss how to share scientific data. The volume of data being produced today is enormous and making all of it available to everyone is not an easy task. That’s not to say it is impossible. If smart people get together and work constructively, it is entirely possible that a reasonable approach could be found. But at this point, a credible long-term solution has yet to emerge.

This brings us back to the Secret Science Reform Act. The latest tactic by opponents of air quality regulation is to force the EPA to ensure that all of the studies that it cites to support new regulations are reproducible. A cursory reading of the bill gives the impression that the sponsors are genuinely concerned about making science more transparent to the public. But when one reads the language of the bill in the context of ongoing discussions about reproducibility, it becomes clear that the sponsors of the bill have no such goal in mind. The purpose of H.R. 4102 is to prevent the Environmental Protection Agency from proposing new regulations.

The EPA develops rules and regulations on the basis of scientific evidence. For example, the Clean Air Act requires EPA to periodically review the scientific literature for the latest evidence on the health effects of air pollution. The science the EPA considers needs to be published in peer-reviewed journals. This makes the EPA a key consumer of scientific knowledge and it uses this knowledge to make informed decisions about protecting public health. What the EPA is not is a large funder of scientific studies. The entire budget for the Office of Research and Development at EPA is roughly $550 million (fiscal 2014), or less than 2 percent of the budget for the National Institutes of Health (about $30 billion for fiscal 2014). This means EPA has essentially no influence over the scientists behind many of the studies it cites because it funds very few of those studies. The best the EPA can do is politely ask scientists to make their data available. If a scientist refuses, there’s not much the EPA can use as leverage.

The latest controversy to come up involves the Harvard Six Cities study published in 1993. This landmark study found a large difference in mortality rates comparing cities with high and low air pollution, even after adjusting for smoking and other factors. The House committee has been trying to make the data for this study publicly available so that it can ensure that regulations are “backed by good science”. However, the Committee has either forgotten or never knew that this particular study has been fully reproduced by independent investigators. In 2005, independent investigators found that they were “...able to reproduce virtually all of the original numerical results, including the 26 percent increase in all-cause mortality in the most polluted city (Stubenville, OH) as compared to the least polluted city (Portage, WI). The audit and validation of the Harvard Six Cities Study conducted by the reanalysis team generally confirmed the quality of the data and the numerical results reported by the original investigators.”

It would be hard to find an air pollution study that has been subject to more scrutiny than the Six Cities studies. Even if you believed the Six Cities study was totally wrong, its original findings have been replicated numerous times since its publication, with different investigators, in different populations, using different analysis techniques, and in different countries. If you’re looking for an example where the science was either not reproducible or not replicable, sorry, but this is not your case study.

Ultimately, it is clear that the sponsors of this bill are cynically taking advantage of a genuine (but difficult) scientific debate over reproducibility to push a political agenda. Scientists are in agreement that reproducibility is important, but there is no consensus yet on how to make it happen for everyone. By forcing the EPA to ensure reproducibility of the science on which it bases regulation, lawmakers are asking the EPA to solve a problem that the entire scientific community has yet to figure out. The end result of passing a bill like H.R. 4102 is that the EPA will be forced to stop proposing any new regulation, handing a major victory to opponents of air quality standards and dealing a major blow to public health in the U.S.

Posted in Uncategorized | 4 Comments

Data Analysis for Genomics edX Course

Mike Love (@mikelove) and I have been working hard the past couple of months preparing a free online edX course on data analysis for genomics. Our target audience are the postdocs, graduate students and research scientists that are tasked with analyzing genomics data, but don't have any formal training. The eight week course will start with the very basics, but will ramp up rather quickly and end with real-life workflows for genome variation, RNA-seq, DNA methylation, and ChIP-seq.

Throughout the course students will learn skills and concepts that provide a foundation for analyzing genomics data. Specifically, we will cover exploratory data analysis, basic statistical inference, linear regression, modeling with parametric distributions, empirical Bayes, multiple comparison corrections and smoothing techniques.

In the class we will make heavy use of computer labs. Almost every lecture is accompanied by an R markdown document that students can use to recreate the plots shown in the lectures. The html document resulting from these R markdown files will result in an html document that will serve as a text book for the class.

Questions will be discussed on online forums led by Stephanie Hicks (@stephaniehicks) and Jim MacDonald.

If you want to sign up, here is the link.

Posted in Uncategorized | 2 Comments

A non-comprehensive comparison of prominent data science programs on cost and frequency.

Screen Shot 2014-03-26 at 9.29.53 AM

We did a really brief comparison of a few notable data science
programs for a grant submission we were working on. I thought it was pretty fascinating, so I'm posting it here. A couple of notes about the table.

1. Our program can be taken for free, which includes assessments. If you want the official certificate and to take the capstone you pay the above costs.

2. Udacity's program can also be taken for free, but if you want the official certificate, assessments, or tutoring you pay the above costs.

3. The asterisks denote programs where you get an official master's degree.

4. The MOOC programs (Udacity's and ours) offer the more flexibility in
the terms of student schedules. Ours is the most flexible with courses
running every month. The in person programs having the least
flexibility but obviously the most direct instructor time.

5) The programs are all quite different in the terms of focus, design,
student requirements, admissions, instruction, cost and value.

6) As far as we know, ours is the only one where every bit of lecture
content has been open sourced (https://github.com/DataScienceSpecialization)

Posted in Uncategorized | 9 Comments

The fact that data analysts base their conclusions on data does not mean they ignore experts

Paul Krugman recently joined the new FiveThirtyEight hating bandwagon. I am not crazy about the new website either (although I'll wait more than one weeks before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugmanam states that "[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that." I don't think that is what Nate Silver, nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, the key word in "Data Science" is not Data, it is Science.

The one example Krugman points too as ignoring experts appears to be written by someone who, according to the article that Krugman links to, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate's analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.

ps - Krugman links to a Timothy Egan piece which has another false dichotomy as the title: "Creativity vs. Quants". He should try doing it before assuming there is no creativity involved in extracting information from data.

Posted in Uncategorized | 4 Comments

The 80/20 rule of statistical methods development

Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is what I call the 80/20 rule (maybe could even by the 90/10 rule). The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. (Edit: Rafa points out that once again I've reverse-scooped a bunch of people and this is already a thing that has been pointed out many times. See for example the Pareto principle and this post also called the 80:20 rule)

Sometimes that extra 20% is really important and sometimes it isn't. In a clinical trial, where each additional patient may cost a large amount of money to recruit and enroll, it is definitely worth the effort. For more exploratory techniques like those often used when analyzing high-dimensional data it may not. This is particularly true because the extra 20% usually comes at a cost of additional assumptions about the way the world works. If your assumptions are right, you get the 20%, if they are wrong, you may lose and it isn't always clear how much.

Here is a very simple example of the 80/20 rule from frequentist statistics - in my experience similar ideas hold in machine learning and Bayesian inference as well. Suppose that I collect some observations  X_1,\ldots, X_n and want to test whether the mean of the observations is greater than 0. Suppose I know that the data are normal and that the variance is equal to 1. Then the absolute best statistical test (called the uniformly most powerful test) you could do rejects the hypothesis the mean is zero if  \bar{X} > z_{\alpha}\left(\frac{1}{\sqrt{n}}\right) .

There are a bunch of other tests you could do though. If you assume the distribution is symmetric you could also use the sign test to test the same hypothesis by creating the random variables  Y_i = 1(X_i > 0) and testing the hypothesis  H_0: Pr(Y_i = 1) = 0.5 versus the alternative that the probability is greater than 0.5 . Or you could use the one sided t-test. Or you could use the Wilcoxon test. These are suboptimal if you know the data are Normal with variance one.

I tried each of these tests with a sample of size  n=20 at the  \alpha=0.05 level. In the plot below I show the ratio of power between each non-optimal test and the optimal z-test (you could do this theoretically but I'm lazy so did it with simulation, code here, colors by RSkittleBrewer).

relpower

The tests get to 80% of the power of the z-test for different sizes of the true mean (0.6 for Wilcoxon, 0.5 for the t-test, and 0.85 for the sign test). Overall, these methods very quickly catch up to the optimal method.

In this case, the non-optimal methods aren't much easier to implement than the optimal solution. But in many cases, the optimal method requires significantly more computation, memory, assumptions, theory, or some combination of the four. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific.

An important corollary of the 80/20 rule is that you can have a huge impact on new technologies if you are the first to suggest an already known 80% solution. For example, the first person to suggest hierarchical clustering or the singular value decomposition for a new high-dimensional data type will often get a large number of citations. But that is a hard way to make a living - you aren't the only person who knows about these methods and the person who says it first soaks up a huge fraction of the credit. So the only way to take advantage of this corollary is to spend your time constantly trying to figure out what the next big technology will be. And you know what they say about prediction being hard, especially about the future.

Posted in Uncategorized | 1 Comment

The time traveler's challenge.

Editor's note: This has nothing to do with statistics. 

I do a lot of statistics for a living and would claim to know a relatively large amount about it. I also know a little bit about a bunch of other scientific disciplines, a tiny bit of engineering, a lot about pointless sports trivia, some current events, the geography of the world (vaguely) and the geography of places I've lived (pretty well).

I have often wondered, if I was transported back in time to a point before the discovery of say, how to make a fire, how much of human knowledge I could recreate. In other words, what would be the marginal effect on the world of a single person (me) being transported back in time. I could propose Newton's Laws, write down a bunch of the basis of calculus, and discover the central limit theorem. I probably couldn't build an internal combustion engine - I know the concept but don't know enough of the details. So the challenge is this.

 If you were transported back 4,000 or 5,000 years, how much could you accelerate human knowledge?

When I told Leah J. about this idea she came up with an even more fascinating variant.

Suppose that I told you that in 5 days you were going to be transported back 4,000 or 5,000 years but you couldn't take anything with you. What would you read about on Wikipedia? 

Posted in Uncategorized | 27 Comments

ENAR is in Baltimore - Here's What To Do

This year's meeting of the Eastern North American Region of the International Biometric Society (ENAR) is in lovely Baltimore, Maryland. As local residents Jeff and I thought we'd put down a few suggestions for what to do during your stay here in case you're not familiar with the area.

Venue

The conference is being held at the Marriott in the Harbor East area of the city, which is relatively new and a great location. There are a number of good restaurants right in the vicinity, including Wit & Wisdom in the Four Seasons hotel across the street and Pabu, an excellent Japanese restaurant that I personally believe is the best restaurant in Baltimore (a very close second is Woodberry Kitchen, which is a bit farther away near Hampden). If you go to Pabu, just don't get sushi; try something new for a change. Around Harbor East you'll also find a Cinghiale (excellent northern Italian restaurant), Charleston (expensive southern food), Lebanese Taverna, and Ouzo Bay. If you're sick of restaurants, there's also a Whole Foods. If you want a great breakfast, you can walk just a few blocks down Aliceanna street to the Blue Moon Cafe. Get the eggs Benedict. If you get the Cap'n Crunch French toast, you will need a nap afterwards.

Just east of Harbor East is an area called Fell's Point. This is commonly known as the "bar district" and it lives up to its reputation. Max's in Fell's Point (on the square) has an obscene number of beers on tap. The Heavy Seas Alehouse on Central Avenue has some excellent beers from the local Heavy Seas brewery and also has great food from chef Matt Seeber. Finally, the Daily Grind coffee shop is a local institution.

Around the Inner Harbor

Outside of the immediate Harbor East area, there are a number of things to do. For kids, there's Port Discovery, which my 3-year-old son seems to really enjoy. There's also the National Aquarium where the Tuesday networking event will be held. This is also a great place for kids if you're bringing family. There's a neat little park on Pier 6 that is small, but has a number of kid-related things to do. It's a nice place to hang out when the weather is nice. Around the other side of the harbor is the Maryland Science Center, another kid-fun place, and just west of the Harbor down Pratt Street is the B&O Railroad Museum, which I think is good for both kids and adults (I like trains).

Unfortunately, at this time there's no football or baseball to watch.

Around Baltimore

There are a lot of really interesting things to check out around Baltimore if you have the time. If you need to get around downtown and the surrounding areas there's the Charm City Circulator which is a free bus that runs every 15 minutes or so. The Mt. Vernon district has a number of cultural things to do. For classical music fans there's the wonderful Baltimore Symphony Orchestra directed by Marin Alsop. The Peabody Institute often has some interesting concerts going on given by the students there. There's the Walters Art Museum, which is free, and has a very interesting collection. There are also a number of good restaurants and coffee shops in Mt. Vernon, like Dooby's (excellent dinner) and Red Emma's  (lots of Noam Chomsky).

That's all I can think of right now. If you have other questions about Baltimore while you're here for ENAR tweet us up at @simplystats.

Posted in Uncategorized | Leave a comment

How to use Bioconductor to find empirical evidence in support of π being a normal number

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.

pi

Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.

pi2

These observed counts should also be independent, and to explore this we can look at autocorrelation plots:

piacf

To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.

library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
    if(d<5){
      patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
    } else{
      patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
    }
    res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
    z <- (res - n*p ) / sqrt( n*p*(1-p) )
    qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
    abline(0,1)
    ##correction: original post had length(res)
    if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1)) 
}
###Now count in segments
d <- 1
m <-10^5

patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
    tmp<-start(matchPattern(pat,x))
    tmp2<-floor( (tmp-1)/m)
    return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
    z <- (res[,i] - m*p) / sqrt( m*p*(1-p)  )
     qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
    abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.

Posted in Uncategorized | Tagged | Leave a comment

Oh no, the Leekasso....

An astute reader (Niels Hansen, who is visiting our department today) caught a bug in my code on Github for the Leekasso. I had:

lm1 = lm(y ~ leekX)
predict.lm(lm1,as.data.frame(leekX2))

Unfortunately, this meant that I was getting predictions for the training set on the test set. Since I set up the test/training sets the same, this meant that I was actually getting training set error rates for the Leekasso. Neils Hansen noticed the bug and reran the fixed code with this term instead:

lm1 = lm(y ~ ., data = as.data.frame(leekX))
predict.lm(lm1,as.data.frame(leekX2))

He created a heatmap subtracting the average accuracy of the Leekasso/Lasso and showed they are essentially equivalent.

LeekassoLasso

This is a bummer, the Leekasso isn't a world crushing algorithm. On the other hand, I'm happy that just choosing the top 10 is still competitive with the optimized lasso on average. More importantly, although I hate being wrong, I appreciate people taking the time to look through my code.

Just out of curiosity I'm taking a survey. Do you think I should publish this top10 predictor thing as a paper? Or do you think it is too trivial?

Posted in Uncategorized | 9 Comments