Simply Statistics


paste0 is statistical computing's most influential contribution of the 21st century

The day I discovered paste0 I literally cried. No more paste(bla,bla, sep=""). While looking through code written by a student who did not know about paste0 I started pondering about how many person hours it has saved humanity. So typing sep="" takes about 1 second. We R users use paste about 100 times  a day and there are about 1,000,000 R users in the world. That's over 3 person years a day! Next up read.table0 (who doesn't want to be TRUE?).


Data supports claim that if Kobe stops ball hogging the Lakers will win more

The Lakers recently snapped a four game losing streak. In that game Kobe, the league leader in field goal attempts and missed shots, had a season low of 14 points but a season high of 14 assists. This makes sense to me since Kobe shooting less means more efficient players are shooting more. Kobe has a lower career true shooting % than Gasol, Howard and Nash (ranked 17,3 and 2 respectively). Despite this he takes more than 1/4 of the shots. Commentators usually praise top scorers no matter what, but recently they have started looking at data and noticed that the Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.


This graph shows score differential versus % of shots taken by Kobe* . Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22)  in score differential. It also suggests that when Kobe takes 15% of the shots, the Lakers win by an average of about 10 points, when he takes 30% (not a rare occurrence) they lose by an average of about 5. Of course we should not take this regression analysis to seriously but it's hard to ignore the fact that when Kobe takes less than 23 23.25% of the shots the Lakers are 13-1.

I suspect that this relationship is not unique to Kobe and the Lakers. In general, teams with  a more balanced attack probably do better. Testing this could be a good project for Jeff's class.

* I approximated shots taken as field goal attempts + floor(0.5 x Free Throw Attempts).

Data is here.

Update: Commentator Sidney fixed some entires in the  data file. Data and plot updated.


Sunday data/statistics link roundup (1/27/2013)

  1. Wisconsin is decoupling the education and degree granting components of education. This means if you take a MOOC like mine, Brian's or Roger's and there is an equivalent class to pass at Wisconsin, you can take the exam and get credit. This is big. (via Rafa)
  2. This  is a really cool MLB visualisation done with d3.js and Crossfilter. It was also prototyped in R, which makes it even cooler. (via Rafa via Chris V.)
  3. Harvard is encouraging their professors to only publish in open access journals and to resign from closed access journals. This is another major change and bodes well for the future of open science (again via Rafa - noticing a theme this week?).
  4. This deserves a post all to itself, but Greece is prosecuting a statistician for analyzing data in a way that changed their deficit figure. I wonder what the folks at the International Year of Statistics think about that? (via Alex N.)
  5. Be on the twitters at 10:30AM Tuesday and follow the hashtag #jhsph753 if you want to hear all the crazy stuff I tell my students when I'm running on no sleep.
  6. Thomas at StatsChat is fed up with Nobel correlations. Although I'm still partial to the length of country name association.

My advanced methods class is now being live-tweeted

A student in my class is going to be live-tweeting my (often silly/controversial) comments in the advanced/Ph.D. data analysis and methods class I'm teaching here at Hopkins. The hashtag is #jhsph753 and the class runs from 10:30am to 12:00PM EST. Check it out here.


Why I disagree with Andrew Gelman's critique of my paper about the rate of false discoveries in the medical literature

With a colleague, I wrote a paper titled, "Empirical estimates suggest most published medical research is true"  which we quietly posted to ArXiv a few days ago. I posted to the ArXiv in the interest of open science and because we didn't want to delay the dissemination of our approach during the long review process. I didn't email anyone about the paper or talk to anyone about it, except my friends here locally.

I underestimated the internet. Yesterday, the paper was covered in this piece on the MIT Tech review. That exposure was enough for the paper to appear in a few different outlets. I'm totally comfortable with the paper, but was not anticipating all of the attention so quickly.

In particular, I was a little surprised to see it appear on Andrew Gelman's blog with the disheartening title, "I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, most published medical research may well be true, but I’m not at all convinced by the analysis being used to support this claim." I responded briefly this morning to his post, but then had to run off to teach class. After thinking about it a little more, I realized I have some objections to his critique.

His main criticisms of our paper are: (1) with type I/type II errors instead of type S versus type M errors (paragraph 2), (2) that we didn't look at replication, we performed inference (paragraph 4), (3) that there is p-value hacking going on (paragraph 4), and (4) he thinks that our model does not apply because p-value hacking my change the assumptions underlying this model in genomics.

I will handle each of these individually:

(1) This is primarily semantics. Andrew is concerned with interesting/uninteresting with his Type S and Type M Errors. We are concerned with true/false positives as defined by type I and type II errors (and a null hypothesis). You might believe that the null is never true - but then by the standards of the original paper all published research is true. Or you might say that a non-null result might have an effect size too small to be interesting - but the framework being used here is hypothesis testing and we have stated how we defined a true positive in that framework explicitly.  We define the error rate by the rate of classifying thing as null when they should be classified as alternative and vice versa. We then estimate the false discovery rate, under the framework used to calculate those p-values. So this is not a criticism of our work with evidence, rather it is a stated difference of opinion about the philosophy of statistics not supported by conclusive data.

(2) Gelman says he originally thought we would follow up specific p-values to see if the results replicated and makes that a critique of our paper. That would definitely be another approach to the problem. Instead, we chose to perform statistical inference using justified and widely used statistical techniques. Others have taken the replication route, but of course that approach too would be fraught with difficulty - are the exact conditions replicable (e.g. for a clinical trial), can we sample from the same population (if it has changed or is hard to sample), and what do we mean by replicates (would two p-values less than 0.05 be convincing?). This again is not a criticism of our approach, but a statement of another, different analysis Gelman was wishing to see.

(3)-(4) Gelman states, "You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on." Indeed Uri Samuelson wrote a paper where he talks about the potential for p-value hacking. He does not collect data from real experiments/analyses, but uses simulations, theoretical arguments, and prospective experiments designed to show specific problems. While these arguments are useful and informative, it gives no indication of the extent of p-value hacking in the medical literature. So this argument is made on the basis of a supposition by Gelman that this happens broadly, rather than on data.

My objection to his criticism is that his critiques are based primarily on philosophy (1), a wish that we had done the study a different way (2), and assumptions about the way science works with only anecdotal evidence (3-4).

One thing you could very reasonably argue is how sensitive our approach is to violations of our assumptions (which Gelman implied with criticisms 3-4). To address this,  my co-author and I have now performed a simulation analysis. In the first simulation, we considered a case where every p-value less than 0.05 was reported and the p-values were uniformly distributed, just as our assumptions would state. We then plot our estimates of the swfdr versus the truth. Here our estimator works pretty well.



We also simulate a pretty serious p-value hacking scenario where people report only the minimum p-value they observe out of 20 p-values. Here our assumption of uniformity is strongly violated. But we still get pretty accurate estimates of the swfdr for the range of values (14%) we report in our paper.


Since I recognize this is only a couple of simulations, I have also put the code up on Github with the rest of our code for the paper so other people can test it out.

Whether you are convinced by Gelman, or convinced by my response, I agree with him that it is pretty unlikely that "most published research is false" so I'm glad our paper is at least bringing that important point up. I also hope that by introducing a new estimator of the science-wise fdr we inspire more methodological development and that philosophical criticisms won't prevent people from looking at the data in new ways.





Statisticians and computer scientists - if there is no code, there is no paper

I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his interview. The open source community is also filled with advocates and researchers who know way more about this than I do.

This post is more about my views on changing the perspective of code/software in the data analysis community. I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. Even worse, sometimes I just want to use their method to solve a problem in our pipeline, but I have to code it from scratch!

I have also had several cases where I emailed the authors for their software and they said it "wasn't fit for distribution" or they "don't have code" or the "code can only be run on our machines". I totally understand the first and last, my code isn't always pretty (I have zero formal training in computer science so messy code is actually the most likely scenario) but I always say, "I'll take whatever you got and I'm willing to hack it out to make it work". I often still am turned down.

So I have a new policy when evaluating CV's of candidates for jobs, or when I'm reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method - I simply mentally cross it off the CV. If I'm reading a data analysis and there isn't code that reproduces their analysis - I mentally cross it off. In my mind, new methods/analyses without software are just vapor ware. Now, you'd definitely have to cross a few papers off my CV, based on this principle. I do that. But I'm trying really hard going forward to make sure nothing gets crossed off.

In a future post I'll talk about the new issue I'm struggling with - maintaing all that software I'm creating.



Sunday data/statistics link roundup (1/20/2013)

  1. This might be short. I have a couple of classes starting on Monday. The first is our Johns Hopkins Advanced Methods class. This is one of my favorite classes to teach, our Ph.D. students are pretty awesome and they always amaze me with what they can do. The other is my Coursera debut in Data Analysis. We are at about 88,000 enrolled. Tell your friends, maybe we can make it an even 100k! In related news, some California schools are experimenting with offering credit for online courses. (via Sherri R.)
  2. Some interesting numbers on why there aren't as many "gunners" in the NBA - players who score a huge number of points.  I love the talk about hustling, rotating team defense. I have always enjoyed watching good defense more than good offense. It might not be the most popular thing to watch, but seeing the Spurs rotate perfectly to cover the open man is a thing of athletic beauty. My Aggies aren't too bad at it either...(via Rafa).
  3. A really interesting article suggesting that nonsense math can make arguments seem more convincing to non-technical audiences. This is tangentially related to a previous study which showed that more equations led to fewer citations in biology articles. Overall, my take home message is that we don't need less equations necessarily; we need to elevate statistical/quantitative literacy to the importance of reading literacy. (via David S.)
  4. This has been posted elsewhere, but a reminder to send in your statistical stories for the 365 stories of statistics.
  5. Automatically generate a postmodernism essay. Hit refresh a few times. It's pretty hilarious. It reminds me a lot of this article about statisticians. Here is the technical paper describing how they simulate the essays. (via Rafa)

Comparing online and in-class outcomes

My colleague John McGready has just published a study he conducted comparing the outcomes of students in the online and in-class versions of his Statistical Reasoning in Public Health class that he teaches here in the fall. In this class the online and in-class portions are taught concurrently, so it's basically one big class where some people are not in the building. Everything is the same for both groups--quizzes, tests, homework, instructor, lecture notes. From the article:

The on-campus version employs twice-weekly 90 minute live lectures. Online students view pre-recorded narrated versions of the same materials. Narrated lecture slides are made available to on-campus students.

The on-campus section has 5 weekly office hour sessions. Online students communicate with the course instructor asynchronously via email and a course bulletin board. The instructor communicates with online students in real time via weekly one-hour online sessions. Exams and quizzes are multiple choice. In 2005, on-campus students took timed quizzes and exams on paper in monitored classrooms. Online students took quizzes via a web-based interface with the same time limits. Final exams for the online students were taken on paper with a proctor.

So how did the two groups fair in their final grades? Pretty much the same. First off, the two groups of students were not the same. Online students were 8 years older on average, more likely to have an MD degree, and more likely to be male. Final exam scores between online and in-class groups differed by -1.2 (out of 100, online group was lower) and after adjusting for student characteristics they differed by -1.5. In both cases, the difference was not statistically significant.

This was not a controlled trial and so there are possibly some problems with unmeasured confounding given that the populations appeared fairly different. It would be interesting to think about a study design that might allow a measure of control or perhaps get a better measure of the difference between online and on-campus learning. But the logistics and demographics of the students would seem to make this kind of experiment challenging.

Here's the best I can think of right now: Take a large class (where all students are on-campus) and get a classroom that can fit roughly half the number of students in the class. Then randomize half the students to be in-class and the other half to be online up until the midterm. After the midterm cross everyone over so that the online group comes into the classroom and the in-class group goes online to take the final. It's not perfect--One issue is that course material tends to get harder as the term goes on and it may be that the "easier" material is better learned online and the harder material is better learned on-campus (or vice versa). Any thoughts?


Review of R Graphics Cookbook by Winston Chang

I just got a copy of Winston Chang's book R Graphics Cookbook, published by O'Reilly Media. This book follows now a series of O'Reilly books on R, including an R Cookbook. Winston Chang is a graduate student at Northwestern University but is probably better known to R users as an active member of the ggplot2 mailing list and an active contributor to the ggplot2 source code.

The book has a typical cookbook format. After some preliminaries about how to install R packages and how to read data into R (Chapter 1), he quickly launches into exploratory data analysis and graphing. The basic outline of each section is:

  1. Statement of problem ("You want to make a histogram")
  2. Solution: If you can reasonably do it with base R graphics, here's how you do it. Oh, and here's how you do it in ggplot2. Notice how it's better? (He doesn't actually say that. He doesn't have to.)
  3. Discussion: This usually revolves around different options that might be set or alternative approaches.
  4. See also: Other recipes in the book.

Interestingly, nowhere in the book is the lattice package mentioned (except in passing). But I suppose that's because ggplot2 pretty much supersedes anything you might want to do in the lattice package. Recently, I've been wondering what the future of the lattice package is given that it doesn't seem to me to be going under very active development. But I digress....

Overall, the book is great. I learned quite a few things just in my initial read of the book and as I dug in a bit more there were some functions that I was not familiar with. Much of the material is straight up ggplot2 stuff so if you're an expert there you probably won't get a whole lot more. But my guess is that most are not experts and so will be able to get something out of the book.

The meat of the book covers a lot of different plotting techniques, enough to make your toolbox quite full. If you pick up this book and think something is missing, my guess is that you're making some pretty esoteric plots. I enjoyed the few sections on specifying colors as well as the recipes on making maps (one of ggplot2's strong points). I wish there were more map recipes, but hey, that's just me.

Towards the end there's a nice discussion of graphics file formats (PDF, PNG, WMF, etc.) and the advantages and disadvantages of each (Chapter 14: Output for Presentation). I particularly enjoyed the discussion of fonts in R graphics since I find this to be a fairly confusing aspect of R, even for seasoned users.

The book ends with a series of recipes related to data manipulation. It's funny how many recipes there are about modifying factor variables, but I guess this is just a function of how annoying it is to modify factor variables. There's also some highlighting of the plyr and reshape2 packages.

Ultimately, I think this is a nice complement to Hadley Wickham's ggplot2 as most of the recipes focus on implementing plots in ggplot2. I don't think you necessarily need to have a deep understanding of ggplot2 in order to use this book (there are some details in an appendix), but some people might want to grab Hadley's book for more background. In fact, this may be a better book to use to get started with ggplot2 simply because it focuses on specific applications. I kept thinking that if the book had been written using base graphics only, it'd probably have to be 2 or 3 times longer just to fit all the code in, which is a testament to the power and compactness of the ggplot2 approach.

One last note: I got the e-book version of the book, but I would recommend the paper version. With books like these, I like to flip around constantly (since there's no need to read it in a linear fashion) and I find e-readers like iBooks and Kindle Reader to be not so good at this.


R package meme

I just got this from a former student who is working on a project with me: