ENAR is in Baltimore - Here's What To Do

This year's meeting of the Eastern North American Region of the International Biometric Society (ENAR) is in lovely Baltimore, Maryland. As local residents Jeff and I thought we'd put down a few suggestions for what to do during your stay here in case you're not familiar with the area.


The conference is being held at the Marriott in the Harbor East area of the city, which is relatively new and a great location. There are a number of good restaurants right in the vicinity, including Wit & Wisdom in the Four Seasons hotel across the street and Pabu, an excellent Japanese restaurant that I personally believe is the best restaurant in Baltimore (a very close second is Woodberry Kitchen, which is a bit farther away near Hampden). If you go to Pabu, just don't get sushi; try something new for a change. Around Harbor East you'll also find a Cinghiale (excellent northern Italian restaurant), Charleston (expensive southern food), Lebanese Taverna, and Ouzo Bay. If you're sick of restaurants, there's also a Whole Foods. If you want a great breakfast, you can walk just a few blocks down Aliceanna street to the Blue Moon Cafe. Get the eggs Benedict. If you get the Cap'n Crunch French toast, you will need a nap afterwards.

Just east of Harbor East is an area called Fell's Point. This is commonly known as the "bar district" and it lives up to its reputation. Max's in Fell's Point (on the square) has an obscene number of beers on tap. The Heavy Seas Alehouse on Central Avenue has some excellent beers from the local Heavy Seas brewery and also has great food from chef Matt Seeber. Finally, the Daily Grind coffee shop is a local institution.

Around the Inner Harbor

Outside of the immediate Harbor East area, there are a number of things to do. For kids, there's Port Discovery, which my 3-year-old son seems to really enjoy. There's also the National Aquarium where the Tuesday networking event will be held. This is also a great place for kids if you're bringing family. There's a neat little park on Pier 6 that is small, but has a number of kid-related things to do. It's a nice place to hang out when the weather is nice. Around the other side of the harbor is the Maryland Science Center, another kid-fun place, and just west of the Harbor down Pratt Street is the B&O Railroad Museum, which I think is good for both kids and adults (I like trains).

Unfortunately, at this time there's no football or baseball to watch.

Around Baltimore

There are a lot of really interesting things to check out around Baltimore if you have the time. If you need to get around downtown and the surrounding areas there's the Charm City Circulator which is a free bus that runs every 15 minutes or so. The Mt. Vernon district has a number of cultural things to do. For classical music fans there's the wonderful Baltimore Symphony Orchestra directed by Marin Alsop. The Peabody Institute often has some interesting concerts going on given by the students there. There's the Walters Art Museum, which is free, and has a very interesting collection. There are also a number of good restaurants and coffee shops in Mt. Vernon, like Dooby's (excellent dinner) and Red Emma's  (lots of Noam Chomsky).

That's all I can think of right now. If you have other questions about Baltimore while you're here for ENAR tweet us up at @simplystats.

Posted in Uncategorized | Leave a comment

How to use Bioconductor to find empirical evidence in support of π being a normal number

Happy π day everybody!

I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for evidence that π is a normal number. A normal number is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10−m. For example, using the Poisson approximation, we can predict that the pattern "123456789" should show up between 0 and 3 times in the first billion digits of π (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).

To test our hypothesis, let Y1, ..., Y100 be the number of "00", "01", ...,"99" in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, 0.89, 0.92, and 0.99.


Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.


These observed counts should also be independent, and to explore this we can look at autocorrelation plots:


To do this in about an hour and with just a few lines of code (included below), I used the Bioconductor Biostrings package to match strings and the foreach function to parallelize.

registerDoParallel(cores = 48)
x=substr(x,3,nchar(x)) ##remove 3.
p <- 1/(10^d)
for(d in 2:4){
    } else{
    res <- foreach(pat=patterns,.combine=c) %dopar% countPattern(pat,x)
    z <- (res - n*p ) / sqrt( n*p*(1-p) )
    qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(d,"digits"))
    ##correction: original post had length(res)
    if(d<5) print(1-pchisq(sum ((res - n*p)^2/(n*p)),length(res)-1)) 
###Now count in segments
d <- 1
m <-10^5

patterns <-sprintf(paste0("%0",d,"d"),seq(0,10^d-1))
res <- foreach(pat=patterns,.combine=cbind) %dopar% {
    tmp2<-floor( (tmp-1)/m)
p <- 1/(10^d)
for(i in 1:ncol(res)){
    z <- (res[,i] - m*p) / sqrt( m*p*(1-p)  )
     qqnorm(z,xlab="Theoretical quantiles",ylab="Observed z-scores",main=paste(i-1))
##ACF plots
for(i in 1:ncol(res)) acf(res[,i])

NB: A normal number has the above stated property in any base. The examples above a for base 10.

Posted in Uncategorized | Tagged | Leave a comment

Oh no, the Leekasso....

An astute reader (Niels Hansen, who is visiting our department today) caught a bug in my code on Github for the Leekasso. I had:

lm1 = lm(y ~ leekX)

Unfortunately, this meant that I was getting predictions for the training set on the test set. Since I set up the test/training sets the same, this meant that I was actually getting training set error rates for the Leekasso. Neils Hansen noticed the bug and reran the fixed code with this term instead:

lm1 = lm(y ~ ., data = as.data.frame(leekX))

He created a heatmap subtracting the average accuracy of the Leekasso/Lasso and showed they are essentially equivalent.


This is a bummer, the Leekasso isn't a world crushing algorithm. On the other hand, I'm happy that just choosing the top 10 is still competitive with the optimized lasso on average. More importantly, although I hate being wrong, I appreciate people taking the time to look through my code.

Just out of curiosity I'm taking a survey. Do you think I should publish this top10 predictor thing as a paper? Or do you think it is too trivial?

Posted in Uncategorized | 9 Comments

Per capita GDP versus years since women received right to vote

Below is a plot of per capita GPD (in log scale) against years since women received the right to vote for 42 countries. Is this cause, effect, both or neither? We all know correlation does not imply causation, but I see many (non statistical) arguments to support both cause and effect here. Happy International Women's Day ! Rplot

The data is from here and here. I removed countries where women have had the right to vote for less than 20 years.

pd -What's with Switzerland?

update - R^2 and p-value added to graph

Posted in Uncategorized | 16 Comments

PLoS One, I have an idea for what to do with all your profits: buy hard drives

I've been closely following the fallout from PLoS One's new policy for data sharing. The policy says, basically, that if you publish a paper, all data and code to go with that paper should be made publicly available at the time of publishing and include an explicit data sharing policy in the paper they submit.

I think the reproducibility debate is over. Data should be made available when papers are published. The Potti scandal and the Reinhart/Rogoff scandal have demonstrated the extreme consequences of lack of reproducibility and the reproducibility advocates have taken this one home. The question with reproducibility isn't "if" anymore it is "how".

The transition toward reproducibility is likely to be rough for two reasons. One is that many people who generate data lack training in handling and analyzing data, even in a data saturated field like genomics. The story is even more grim in areas that haven't been traditionally considered "data rich" fields.

The second problem is a cultural and economic problem. It involves the fundamental disconnect between (1) the incentives of our system for advancement, grant funding, and promotion and (2) the policies that will benefit science and improve reproducibility. Most of the debate on social media seems to conflate these two issues. I think it is worth breaking the debate down into three main constituencies: journals, data creators, and data analysts.

Journals with requirements for data sharing

Data sharing, especially for large data sets, isn't easy and it isn't cheap. Not knowing how to share data is not an excuse - to be a modern scientist this is one of the skills you have to have. But if you are a journal that makes huge profits and you want open sharing, you should put up or shut up. The best way to do that would be to pay for storage on something like AWS for all data sets submitted to comply with your new policy. In the era of cheap hosting and standardized templates, charging $1,000 or more for an open access paper is way too much. It costs essentially nothing to host that paper online and you are getting peer review for free. So you should spend some of your profits paying for the data sharing that will benefit your journal and the scientific community.

Data creators

It is really hard to create a serious, research quality data set in almost any scientific discipline. If you are studying humans, it requires careful adherence to rules and procedures for handling human data. If you are in ecology, it may involve extensive field work. If you are in behavioral research, it may involve careful review of thousands of hours of video tape.

The value of one careful, rigorous, and interesting data set is hard to overstate. In my field, the data Leonid Kruglyak's group generated measuring gene expression and genetics in a careful yeast experiment spawned an entirely new discipline within both genomics and statistics.

The problem is that to generate one really good data set can take months or even years. It is definitely possible to publish more than one paper on a really good data set. But after the data are generated, most of these papers will have to do with data analysis, not data generation. If there are ten papers that could be published on your data set and your group publishes the data with the first one, you may get to the second or third, but someone else might publish 4-10.

This may be good for science, but it isn't good for the careers of data generators. Ask anyone in academics whether you'd rather have 6 citations from awesome papers or 6 awesome papers and 100% of them will take the papers.

I'm completely sympathetic to data generators who spend a huge amount of time creating a data set and are worried they may be scooped on later papers. This is a place where the culture of credit hasn't caught up with the culture of science. If you write a grant and generate an amazing data set that 50 different people use - you should absolutely get major credit for that in your next grant. However, you probably shouldn't get authorship unless you intellectually contributed to the next phase of the analysis.

The problem is we don't have an intermediate form of credit for data generators that is weighted more heavily than a citation. In the short term, this lack of a proper system of credit will likely lead data generators to make the following (completely sensible) decision to hold their data close and then publish multiple papers at once - like ENCODE did. This will drive everyone crazy and slow down science - but it is the appropriate career choice for data generators until our system of credit has caught up.

Data analysts

I think that data analysts who are pushing for reproducibility are genuine in their desire for reproducibility. I also think that the debate is over. I think we can contribute to the success of the reproducibility transition by figuring out ways to give stronger and more appropriate credit to data generators. I don't think authorship is the right approach. But I do think that it is the right approach to loudly and vocally give credit to people who generated the data you used in your purely data analytic paper. That includes making sure the people that are responsible for their promotion and grants know just how incredibly critical it is that they keep generating data so you can keep doing your analysis.

Finally, I think that we should be more sympathetic to the career concerns of folks who generate data. I have written methods and made the code available. I have then seen people write very similar papers using my methods and code - then getting credit/citations for producing a very similar method to my own. Being reverse scooped like this is incredibly frustrating. If you've ever had that experience imagine what it would feel like to spend a whole year creating a data set and then only getting one publication.

I also think that the primary use of reproducibility so far has been as a weapon. It has been used (correctly) to point out critical flaws in research. It has also been used as a way to embarrass authors who don't (and even some who do) have training in data analysis. The transition to fully reproducible science can either be a painful fight or a smoother transition. One thing that would go a long way would be to think of code review/reproducibility not like peer review, but more like pull requests and issues on Github. The goal isn't to show how the other person did it wrong, the goal is to help them do it right.


Posted in Uncategorized | 7 Comments

Data Science is Hard, But So is Talking

Jeff, Brian, and I had to record nine separate introductory videos for our Data Science Specialization and, well, some of us were better at it than others. It takes a bit of practice to read effectively from a teleprompter, something that is exceedingly obvious from this video.

Posted in Uncategorized | 5 Comments

Here's why the scientific publishing system can never be "fixed"

There's been much discussion recently about how the scientific publishing system is "broken". Just the latest one that I saw was a tweet from Princeton biophysicist Josh Shaevitz:

On this blog, we've talked quite a bit about the publishing system, including in this interview with Michael Eisen. Jeff recently posted about changing the reviewing system (again). We have a few other posts on this topic. Yes, we like to complain like the best of them.

But there's a simple fact: The scientific publishing system, as broken as you may find it to be, can never truly be fixed.

Here's the tl;dr

  • The collection of scientific publications out there make up a marketplace of ideas, hypotheses, theorems, conjectures, and comments about nature.
  • Each member of society has an algorithm for placing a value on each of those publications. Valuation methodologies vary, but they often include factors like the reputation of the author(s), the journal in which the paper was published, the source of funding, as well as one's own personal beliefs about the quality of the work described in the publication.
  • Given a valuation methodology, each scientist can rank order the publications from "most valuable" to "least valuable".
  • Fixing the scientific publication system would require forcing everyone to agree on the same valuation methodology for all publications.

The Marketplace of Publications

The first point is that the collection of scientific publications make up a kind of market of ideas. Although we don't really "trade" publications in this market, we do estimate the value of each publication and label some as "important" and some as not important. I think this is important because it allows us to draw analogies with other types of markets. In particular, consider the following question: Can you think of a market in any item where each item was priced perfectly, so that every (rational) person agreed on its value? I can't.

Consider the stock market, which might be the most analyzed market in the world. Professional investors make their entire living analyzing the companies that are listed on stock exchanges and buying and selling their shares based on what they believe is the value of those companies. And yet, there can be huge disagreements over the valuation of these companies. Consider the current Herbalife drama, where investors William Ackman and Carl Icahn (and Daniel Loeb) are taking complete opposite sides of the trade (Ackman is short and Icahn is long). They can't both be right about the valuation; they must have different valuation strategies. Everyday, the market's collective valuation of different companies changes, reacting to new information and perhaps to irrational behavior. In the long run, good companies survive while others do not. In the meantime, everyone will argue about the appropriate price.

Journals are in some ways like the stock exchanges of yore. There are very prestigious ones (e.g. NYSE, the "big board") and there are less prestigious ones (e.g. NASDAQ) and everyone tries to get their publication into the prestigious journals. Journals have listing requirements--you can't just put any publication in the journal. It has to meet certain standards set by the journal. The importance of being listed on a prestigious stock exchange has diminished somewhat over the years. The most valuable company in the world trades on the NASDAQ.  Similarly, although Science, Nature, and the New England Journal of Medicine are still quite sought after by scientists, competition is increasing from journals (such as those from the Public Library of Science) who are willing to publish papers that are technically correct and let readers determine their importance.

What's the "Fix"?

Now let's consider a world where we obliterate journals like Nature and Science and that there's only the "one true journal". Suppose this journal accepts any publication that satisfies some basic technical requirements (i.e. not content-based) and then has a sophisticated rating system that allows readers to comment on, rate, and otherwise evaluate each publication. There is no pre-publication peer review. Everything is immediately published. Problem solved? Not really, in my opinion. Here's what I think would end up happening:

  • People would have to (slightly) alter their methodology for ranking individual scientists. They would not be able to say "so-and-so has 10 Nature papers, so he must be good". But most likely, another proxy for actually reading the appears would arise. For example, "My buddy from University of Whatever put this paper in his top-ten list, so it must be good". As Michael Eisen said in our interview, the ranking system induced by journals like Science and Nature is just an abstract hierarchy; we can still reproduce the hierarchy even if Science/Nature don't exist.
  • In the current system, certain publications often "get stuck" with overly inflated valuations and it is often difficult to effectively criticize such publications because there does not exist an equivalent venue for informed criticism on par with Science and Nature. These publications achieve such high valuations partly because they are published in high-end journals like Nature and Science, but partly it is because some people actually believe they are valuable. In other words, it is possible to create a "bubble" where people irrationally believe a publication is valuable, just because everyone believes it's valuable. If you destroy the current publication system, there will still be publications that are "over-valued", just like in every other market. And furthermore, it will continue to be difficult to criticize such publications. Think of all the analysts that were yelling about how the housing market was dangerously inflated back in 2007. Did anyone listen? Not until it was too late.

What Can be Done?

I don't mean for this post to be depressing, but I think there's a basic reality about publication that perhaps is not fully appreciated. That said, I believe there are things that can be done to improve science itself, as well as the publication system.

  • Raise the ROC curves of science. Efforts in this direction make everyone better and improve our ability to make more important discoveries.
  • Increase the reproducibility of science. This is kind of the "Sarbanes-Oxley" of science. For the most part, I think the debate about whether science should be made more reproducible is coming to a close (or it is for me). The real question is how do we do it, for all scientists? I don't think there are enough people thinking about this question. It will likely be a mix of different strategies, policies, incentives, and tools.
  • Develop more sophisticated evaluation technologies for publications. Again, to paraphrase Michael Eisen, we are better able to judge the value of a pencil on Amazon than we are able to judge a scientific publication. The technology exists for improving the system, but someone has to implement it. I think a useful system along these lines would go a long way towards de-emphasizing the importance of "vanity journals" like Nature and Science.
  • Make open access more accessible. Open access journals have been an important addition to the publication universe, but they are still very expensive (the cost has just been shifted). We need to think more about lowering the overall cost of publication so that it is truly open access.

Ultimately, in a universe where there are finite resources, a system has to be developed to determine how those resources should be distributed. Any system that we can come up with will be flawed as there will by necessity have to be winners and losers. I think there are serious efforts that need to be made to make the system more fair and more transparent, but the problem will never truly be "fixed" to everyone's satisfaction.

Posted in Uncategorized | 5 Comments

Why do we love R so much?

When Jeff, Brian, and I started the Johns Hopkins Data Science Specialization we decided early on to organize the program around using R. Why? Because we love R, we use it everyday, and it has an incredible community of developers and users. The R community has created an ecosystem of packages and tools that lets R continue to be relevant and useful for real problems.

We created a short video to talk about one of the reasons we love R so much.

Posted in Uncategorized | 6 Comments

k-means clustering in a GIF

k-means is a simple and intuitive clustering approach. Here is a movie showing how it works:


Posted in Uncategorized | 5 Comments

Repost: Ronald Fisher is one of the few scientists with a legit claim to most influential scientist ever

Editor's Note: Ronald  This is a repost of the post "R.A. Fisher is the most influential scientist ever" with a picture of my pilgrimage to his  gravesite in Adelaide, Australia. 

You can now see profiles of famous scientists on Google Scholar citations. Here are links to a few of them (via Ben L.). Von NeumannEinsteinNewtonFeynman

But their impact on science pales in comparison (with the possible exception of Newton) to the impact of one statistician: R.A. Fisher. Many of the concepts he developed are so common and are considered so standard, that he is never cited/credited. Here are some examples of things he invented along with a conservative number of citations they would have received calculated via Google Scholar*.

  1. P-values - 3 million citations
  2. Analysis of variance (ANOVA) - 1.57 million citations
  3. Maximum likelihood estimation - 1.54 million citations
  4. Fisher’s linear discriminant 62,400 citations
  5. Randomization/permutation tests 37,940 citations
  6. Genetic linkage analysis 298,000 citations
  7. Fisher information 57,000 citations
  8. Fisher’s exact test 237,000 citations

A couple of notes:

  1. These are seriously conservative estimates, since I only searched for a few variants on some key words
  2. These numbers are BIG, there isn’t another scientist in the ballpark. The guy who wrote the “most highly cited paper” got 228,441 citations on GS. His next most cited paper? 3,000 citations. Fisher has at least 5 papers with more citations than his best one.
  3. This page says Bert Vogelstein has the most citations of any person over the last 30 years. If you add up the number of citations to his top 8 papers on GS, you get 57,418. About as many as to the Fisher information matrix.

I think this really speaks to a couple of things. One is that Fisher invented some of the most critical concepts in statistics. The other is the breadth of impact of statistical ideas across a range of disciplines. In any case, I would be hard pressed to think of another scientist who has influenced a greater range or depth of scientists with their work.

Update: I recently when to Adelaide to give a couple of talks on Bioinformatics, Statistics and MOOCs. My host Gary informed me that Fisher was buried in Adelaide. I went to the cathedral to see the memorial and took this picture. I couldn't get my face in the picture because the plaque was on the ground. You'll have to trust me that these are my shoes.

2013-12-03 16.27.07

* Calculations of citations #####################

  1. As described in a previous post
  2. # of GS results for “Analysis of Variance” + # for “ANOVA” - “Analysis of Variance”
  3. # of GS results for “maximum likelihood”
  4. # of GS results for “linear discriminant”
  5. # of GS results for “permutation test” + # for ”permutation tests” - “permutation test”
  6. # of GS results for “linkage analysis”
  7. # of GS results for “fisher information” + # for “information matrix” - “fisher information”
  8. # of GS results for “fisher’s exact test” + # for “fisher exact test” - “fisher’s exact test”
Posted in Uncategorized | 6 Comments