## Data Analysis for Genomics edX Course

Mike Love (@mikelove) and I have been working hard the past couple of months preparing a free online edX course on data analysis for genomics. Our target audience are the postdocs, graduate students and research scientists that are tasked with analyzing genomics data, but don't have any formal training. The eight week course will start with the very basics, but will ramp up rather quickly and end with real-life workflows for genome variation, RNA-seq, DNA methylation, and ChIP-seq.

Throughout the course students will learn skills and concepts that provide a foundation for analyzing genomics data. Specifically, we will cover exploratory data analysis, basic statistical inference, linear regression, modeling with parametric distributions, empirical Bayes, multiple comparison corrections and smoothing techniques.

In the class we will make heavy use of computer labs. Almost every lecture is accompanied by an R markdown document that students can use to recreate the plots shown in the lectures. The html document resulting from these R markdown files will result in an html document that will serve as a text book for the class.

Questions will be discussed on online forums led by Stephanie Hicks (@stephaniehicks) and Jim MacDonald.

Posted in Uncategorized | 2 Comments

## A non-comprehensive comparison of prominent data science programs on cost and frequency.

We did a really brief comparison of a few notable data science
programs for a grant submission we were working on. I thought it was pretty fascinating, so I'm posting it here. A couple of notes about the table.

1. Our program can be taken for free, which includes assessments. If you want the official certificate and to take the capstone you pay the above costs.

2. Udacity's program can also be taken for free, but if you want the official certificate, assessments, or tutoring you pay the above costs.

3. The asterisks denote programs where you get an official master's degree.

4. The MOOC programs (Udacity's and ours) offer the more flexibility in
the terms of student schedules. Ours is the most flexible with courses
running every month. The in person programs having the least
flexibility but obviously the most direct instructor time.

5) The programs are all quite different in the terms of focus, design,
student requirements, admissions, instruction, cost and value.

6) As far as we know, ours is the only one where every bit of lecture
content has been open sourced (https://github.com/DataScienceSpecialization)

Posted in Uncategorized | 9 Comments

## The fact that data analysts base their conclusions on data does not mean they ignore experts

Paul Krugman recently joined the new FiveThirtyEight hating bandwagon. I am not crazy about the new website either (although I'll wait more than one weeks before judging) but in a recent post Krugman creates a false dichotomy that is important to correct. Krugmanam states that "[w]hat [Nate Silver] seems to have concluded is that there are no experts anywhere, that a smart data analyst can and should ignore all that." I don't think that is what Nate Silver, nor any other smart data scientist or applied statistician has concluded. Note that to build his election prediction model, Nate had to understand how the electoral college works, how polls work, how different polls are different, the relationship between primaries and presidential election, among many other details specific to polls and US presidential elections. He learned all of this by reading and talking to experts. Same is true for PECOTA where data analysts who know quite a bit about baseball collect data to create meaningful and predictive summary statistics. As Jeff said before, the key word in "Data Science" is not Data, it is Science.

The one example Krugman points too as ignoring experts appears to be written by someone who, according to the article that Krugman links to, was biased by his own opinions, not by data analysis that ignored experts. However, in Nate's analysis of polls and baseball data it is hard to argue that he let his bias affect his analysis. Furthermore, it is important to point out that he did not simply stick data into a black box prediction algorithm. Instead he did what most of us applied statisticians do: we build empirically inspired models but guided by expert knowledge.

ps - Krugman links to a Timothy Egan piece which has another false dichotomy as the title: "Creativity vs. Quants". He should try doing it before assuming there is no creativity involved in extracting information from data.

Posted in Uncategorized | 4 Comments

## The 80/20 rule of statistical methods development

Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is what I call the 80/20 rule (maybe could even by the 90/10 rule). The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. (Edit: Rafa points out that once again I've reverse-scooped a bunch of people and this is already a thing that has been pointed out many times. See for example the Pareto principle and this post also called the 80:20 rule)

Sometimes that extra 20% is really important and sometimes it isn't. In a clinical trial, where each additional patient may cost a large amount of money to recruit and enroll, it is definitely worth the effort. For more exploratory techniques like those often used when analyzing high-dimensional data it may not. This is particularly true because the extra 20% usually comes at a cost of additional assumptions about the way the world works. If your assumptions are right, you get the 20%, if they are wrong, you may lose and it isn't always clear how much.

Here is a very simple example of the 80/20 rule from frequentist statistics - in my experience similar ideas hold in machine learning and Bayesian inference as well. Suppose that I collect some observations $X_1,\ldots, X_n$ and want to test whether the mean of the observations is greater than 0. Suppose I know that the data are normal and that the variance is equal to 1. Then the absolute best statistical test (called the uniformly most powerful test) you could do rejects the hypothesis the mean is zero if $\bar{X} > z_{\alpha}\left(\frac{1}{\sqrt{n}}\right)$ .

There are a bunch of other tests you could do though. If you assume the distribution is symmetric you could also use the sign test to test the same hypothesis by creating the random variables $Y_i = 1(X_i > 0)$ and testing the hypothesis $H_0: Pr(Y_i = 1) = 0.5$ versus the alternative that the probability is greater than 0.5 . Or you could use the one sided t-test. Or you could use the Wilcoxon test. These are suboptimal if you know the data are Normal with variance one.

I tried each of these tests with a sample of size $n=20$ at the $\alpha=0.05$ level. In the plot below I show the ratio of power between each non-optimal test and the optimal z-test (you could do this theoretically but I'm lazy so did it with simulation, code here, colors by RSkittleBrewer).

The tests get to 80% of the power of the z-test for different sizes of the true mean (0.6 for Wilcoxon, 0.5 for the t-test, and 0.85 for the sign test). Overall, these methods very quickly catch up to the optimal method.

In this case, the non-optimal methods aren't much easier to implement than the optimal solution. But in many cases, the optimal method requires significantly more computation, memory, assumptions, theory, or some combination of the four. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific.

An important corollary of the 80/20 rule is that you can have a huge impact on new technologies if you are the first to suggest an already known 80% solution. For example, the first person to suggest hierarchical clustering or the singular value decomposition for a new high-dimensional data type will often get a large number of citations. But that is a hard way to make a living - you aren't the only person who knows about these methods and the person who says it first soaks up a huge fraction of the credit. So the only way to take advantage of this corollary is to spend your time constantly trying to figure out what the next big technology will be. And you know what they say about prediction being hard, especially about the future.

Posted in Uncategorized | 1 Comment

## The time traveler's challenge.

Editor's note: This has nothing to do with statistics.

I do a lot of statistics for a living and would claim to know a relatively large amount about it. I also know a little bit about a bunch of other scientific disciplines, a tiny bit of engineering, a lot about pointless sports trivia, some current events, the geography of the world (vaguely) and the geography of places I've lived (pretty well).

I have often wondered, if I was transported back in time to a point before the discovery of say, how to make a fire, how much of human knowledge I could recreate. In other words, what would be the marginal effect on the world of a single person (me) being transported back in time. I could propose Newton's Laws, write down a bunch of the basis of calculus, and discover the central limit theorem. I probably couldn't build an internal combustion engine - I know the concept but don't know enough of the details. So the challenge is this.

If you were transported back 4,000 or 5,000 years, how much could you accelerate human knowledge?

When I told Leah J. about this idea she came up with an even more fascinating variant.

Suppose that I told you that in 5 days you were going to be transported back 4,000 or 5,000 years but you couldn't take anything with you. What would you read about on Wikipedia?

Posted in Uncategorized | 27 Comments

## ENAR is in Baltimore - Here's What To Do

This year's meeting of the Eastern North American Region of the International Biometric Society (ENAR) is in lovely Baltimore, Maryland. As local residents Jeff and I thought we'd put down a few suggestions for what to do during your stay here in case you're not familiar with the area.

Venue

The conference is being held at the Marriott in the Harbor East area of the city, which is relatively new and a great location. There are a number of good restaurants right in the vicinity, including Wit & Wisdom in the Four Seasons hotel across the street and Pabu, an excellent Japanese restaurant that I personally believe is the best restaurant in Baltimore (a very close second is Woodberry Kitchen, which is a bit farther away near Hampden). If you go to Pabu, just don't get sushi; try something new for a change. Around Harbor East you'll also find a Cinghiale (excellent northern Italian restaurant), Charleston (expensive southern food), Lebanese Taverna, and Ouzo Bay. If you're sick of restaurants, there's also a Whole Foods. If you want a great breakfast, you can walk just a few blocks down Aliceanna street to the Blue Moon Cafe. Get the eggs Benedict. If you get the Cap'n Crunch French toast, you will need a nap afterwards.

Just east of Harbor East is an area called Fell's Point. This is commonly known as the "bar district" and it lives up to its reputation. Max's in Fell's Point (on the square) has an obscene number of beers on tap. The Heavy Seas Alehouse on Central Avenue has some excellent beers from the local Heavy Seas brewery and also has great food from chef Matt Seeber. Finally, the Daily Grind coffee shop is a local institution.

Around the Inner Harbor

Outside of the immediate Harbor East area, there are a number of things to do. For kids, there's Port Discovery, which my 3-year-old son seems to really enjoy. There's also the National Aquarium where the Tuesday networking event will be held. This is also a great place for kids if you're bringing family. There's a neat little park on Pier 6 that is small, but has a number of kid-related things to do. It's a nice place to hang out when the weather is nice. Around the other side of the harbor is the Maryland Science Center, another kid-fun place, and just west of the Harbor down Pratt Street is the B&O Railroad Museum, which I think is good for both kids and adults (I like trains).

Unfortunately, at this time there's no football or baseball to watch.

Around Baltimore

There are a lot of really interesting things to check out around Baltimore if you have the time. If you need to get around downtown and the surrounding areas there's the Charm City Circulator which is a free bus that runs every 15 minutes or so. The Mt. Vernon district has a number of cultural things to do. For classical music fans there's the wonderful Baltimore Symphony Orchestra directed by Marin Alsop. The Peabody Institute often has some interesting concerts going on given by the students there. There's the Walters Art Museum, which is free, and has a very interesting collection. There are also a number of good restaurants and coffee shops in Mt. Vernon, like Dooby's (excellent dinner) and Red Emma's  (lots of Noam Chomsky).

That's all I can think of right now. If you have other questions about Baltimore while you're here for ENAR tweet us up at @simplystats.

## How to use Bioconductor to find empirical evidence in support of π being a normal number

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.

Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.

These observed counts should also be independent, and to explore this we can look at autocorrelation plots:

To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.

library(Biostrings)
library(doParallel)
registerDoParallel(cores = 48)
x=scan("pi-billion.txt",what="c")
x=substr(x,3,nchar(x)) ##remove 3.
x=BString(x)
n<-length(x)
p <- 1/(10^d)
par(mfrow=c(2,3))
for(d in 2:4){
if(d<5){
patterns<-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
} else{
patterns<-sprintf(paste0("%0",d,"d"),sample(10^d,10^4)-1)
}
res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
z <- (res - n*p ) / sqrt( n*p*(1-p) )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
abline(0,1)
if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1))
}
###Now count in segments
d <- 1
m <-10^5

patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
tmp<-start(matchPattern(pat,x))
tmp2<-floor( (tmp-1)/m)
return(tabulate(tmp2+1,nbins=n/m))
}
##qq-plots
par(mfrow=c(2,5))
p <- 1/(10^d)
for(i in 1:ncol(res)){
z <- (res[,i] - m*p) / sqrt( m*p*(1-p)  )
qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
abline(0,1)
}
##ACF plots
par(mfrow=c(2,5))
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.

## Oh no, the Leekasso....

An astute reader (Niels Hansen, who is visiting our department today) caught a bug in my code on Github for the Leekasso. I had:

lm1 = lm(y ~ leekX)
predict.lm(lm1,as.data.frame(leekX2))

Unfortunately, this meant that I was getting predictions for the training set on the test set. Since I set up the test/training sets the same, this meant that I was actually getting training set error rates for the Leekasso. Neils Hansen noticed the bug and reran the fixed code with this term instead:

lm1 = lm(y ~ ., data = as.data.frame(leekX))
predict.lm(lm1,as.data.frame(leekX2))

He created a heatmap subtracting the average accuracy of the Leekasso/Lasso and showed they are essentially equivalent.

This is a bummer, the Leekasso isn't a world crushing algorithm. On the other hand, I'm happy that just choosing the top 10 is still competitive with the optimized lasso on average. More importantly, although I hate being wrong, I appreciate people taking the time to look through my code.

Just out of curiosity I'm taking a survey. Do you think I should publish this top10 predictor thing as a paper? Or do you think it is too trivial?

Posted in Uncategorized | 9 Comments

## Per capita GDP versus years since women received right to vote

Below is a plot of per capita GPD (in log scale) against years since women received the right to vote for 42 countries. Is this cause, effect, both or neither? We all know correlation does not imply causation, but I see many (non statistical) arguments to support both cause and effect here. Happy International Women's Day !

The data is from here and here. I removed countries where women have had the right to vote for less than 20 years.

pd -What's with Switzerland?

update - R^2 and p-value added to graph

Posted in Uncategorized | 16 Comments

## PLoS One, I have an idea for what to do with all your profits: buy hard drives

I've been closely following the fallout from PLoS One's new policy for data sharing. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.

I think the reproducibility debate is over. Data should be made available when papers are published. The Potti scandal and the Reinhart/Rogoff scandal have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn't "if" anymore it is "how".

The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in handling and analyzing data, even in a data saturated field like genomics. The story is even more grim in areas that haven't been traditionally considered "data rich" fields.

The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.

Journals with requirements for data sharing

Data sharing, especially for large data sets, isn't easy and it isn't cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that makes huge profits and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging \$1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.

Data creators

It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.

The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak's group generated measuring gene expression and genetics in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.

The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.

This may be good for science, but it isn't good for the careers of data generators. Ask anyone in academics whether you'd rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.

I'm completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn't caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn't get authorship unless you intellectually contributed to the next phase of the analysis.

The problem is we don't have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - like ENCODE did. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.

Data analysts

I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don't think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.

Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being reverse scooped like this is incredibly frustrating. If you've ever had that experience imagine what it would feel like to spend a whole year creating a data set and then only getting one publication.

I also think that the primary use of reproducibility so far has been as a weapon. It has been used (correctly) to point out critical flaws in research. It has also been used as a way to embarrass authors who don't (and even some who do) have training in data analysis. The transition to fully reproducible science can either be a painful fight or a smoother transition. One thing that would go a long way would be to think of code review/reproducibility not like peer review, but more like pull requests and issues on Github. The goal isn't to show how the other person did it wrong, the goal is to help them do it right.

Posted in Uncategorized | 7 Comments