Simply Statistics


Why I don't use ggplot2

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don't know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.

But I don't use ggplot2 and I get nervous when other people do.

I get no end of grief for this from Hilary and Roger and especially from drob, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don't use it, you have to understand the three cases where I use data visualization.

  1. When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
  2. When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
  3. When grading student data analyses.

Let's consider each case.

Exploratory graphs

Exploratory graphs don't have to be pretty. I'm going to be the only one who looks at 99% of them. But I have to be able to make them quickly and I have to be able to make a broad range of plots with minimal code. There are a large number of types of graphs, including things like heatmaps, that don't neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (like this one) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don't get any of the advantages of the ggplot2 framework.

Expository graphs

When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the comparison of different plotting systems for creating Tufte-like graphs. To create this minimal barchart:


The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).

Screen Shot 2016-02-11 at 12.56.53 PM

in ggplot2 the code is:

Screen Shot 2016-02-11 at 12.56.39 PM


Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.

The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.

Grading student work

People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the little data scientist's predicament. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of quakes data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:

ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))

And get this out:


That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn't even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.

The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.

Where ggplot2 is better for sure

ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create awesome add-ons are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I'm psyched about the reach and power ggplot2 has had. Still, I probably won't use it for my own work, even thought it disappoints my data scientist friends.


Data handcuffs

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

A few years ago, if you asked me what the top skills I got asked about for students going into industry, I'd definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.

In particular I see data scientists talking more about A/B testing, sequential stopping rules, hazard regression and other ideas  that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.

I think it is great that companies are choosing to do experiments, as this still remains the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be "data-driven".  They make all decisions based on data they have collected or experiments they have performed.

But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can't predict conditions in the future or ideas that may work out or might not.

In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don't discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.

Regardless, almost all of the most interesting projects I've worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn't in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of my biggest flameouts.


Leek group guide to reading scientific papers

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The other day on Twitter Amelia requested a guide for reading papers


So I came up with a guide which you can find here: Leek group guide to reading papers. I actually found this to be one that I had the hardest time with. I described how I tend to read a paper but I'm not sure that is really the optimal (or even a very good) way. I'd really appreciate pull requests if you have ideas on how to improve the guide.


A menagerie of messed up data analyses and how to avoid them

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Update: I realize this may seem like I'm picking on people. I really don't mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from "I got a big one here" when I made a huge mistake as a first year assistant professor.

In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a "menagerie" of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I'm doing the non-comprehensive list :).



Outcodirection411me switching

What it is: Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn't really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.

An example: In this article they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug.

What you can do: Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's pre-registration challenge.


Garden of forking paths

What it is: In this case you may or may not have specified your outcome and stuck with it. Let's assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn't specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being "sensible"

An exampleThis article gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.

What you can do: Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn't pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren't dramatically different.



What it is: The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn't correct for multiple testing, or uses any of these other means to cheat and get a result that they like.

An example: This one gets talked about a lot and there is some evidence that it happens. But it is usually pretty hard to ascribe purely evil intentions to people and I'd rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.

What to do: Know how to do an analysis well and don't cheat.

Update:  Some people define p-hacking differently as when "when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman & Loken, 2014; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011; Vazire, 2015)." This coincides with the definition of "garden of forking paths". I have been asked to point this out on Twitter. It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think "p-hacking" corresponds to my definition above, although I agree with folks that isn't helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.


paypal15Uncorrected multiple testing 

What it is: This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.

An example:  The most famous example is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P < 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P < 0.05 just by chance.

What you can do: Correct for multiple testing. When you calculate a large number of p-values make sure you know what their distribution is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.


animal162I got a big one here

What it is: One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.

An example: In a paper authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. This might explain a large fraction of these differences.

What you can do: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend a lot of time trying to figure out why it could be a mistake. If you don't, others definitely will, and you might be embarrassed.

man298Double complication

What it is: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.

An example: There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people tried to use complicated prediction algorithms to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.

What you can do: When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets even simple things work.






Image credits:


Exactly how risky is breathing?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

This article by by George Johnson in the NYT describes a study by Kamen P. Simonov​​ and Daniel S. Himmelstein​ that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.

All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

I'm not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the paper itself, already a major victory. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one--all the potential confounders, selection bias, negative controls, etc.

I remember particulate matter air pollution epidemiology used to have this feel. You'd try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it....


On research parasites and internet mobs - let's try to solve the real problem.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term "research parasite":

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

  1. "The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good data sharing policies and respond to queries from people using their data promptly then this should not be a problem at all.
  2. "... but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited." The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that is not papers. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.
  3. "The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own."  The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn't the only way science should work. Researchers shouldn't be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn't, but in neither case should we label the researchers "symbiotic" or "parasitic", terms that have extreme connotations.
  4. "How would data sharing work best? We think it should happen symbiotically, not parasitically." I think that it should happen automatically. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should get credit for generating the data set and the hypothesis that led to the data set. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.
  5. "Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested." The trouble with this framework is that it preferentially accrues credit to data generators and doesn't accurately describe the role of either party. To flip this argument around,  you could just as easily say that anyone who uses Steven Salzberg's software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.

After the piece was posted there was predictable internet rage from data parasites, a dedicated hashtag, and half a dozen angry blog posts written about the piece. These inspired a follow up piece from Drazen. I recognize why these folks were upset - the "research parasites" thing was unnecessarily inflammatory. But I also sympathize with data creators who are also subject to a tough environment - particularly when they are junior scientists.

I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I'd like to see something that acknowledges different contributions appropriately and doesn't slow down science. Something like:

  1. We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.
  2. When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.
  3. We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.
  4. We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.
  5. We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.
  6. When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.
  7. When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.



Not So Standard Deviations Episode 8 - Snow Day

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal's view on data sharing, Google's "Cohort Analysis", and trying to predict a movie's box office returns based on the movie's script.

Subscribe to the podcast on iTunes.

Follow @NSSDeviations on Twitter!

Show notes:

Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!

Download the audio for this episode.



Parallel BLAS in R

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I'm working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:

What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x'). Then it computes x x'. The second part is key, because it involves a matrix multiplication.

Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what's known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here's what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 59.622   0.314  59.927 

Note that the "user" time and the "elapsed" time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.

Now, when I run the same code on my built-from-source version of R (version 3.2.3), here's what I get:

system.time({ x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })
   user  system elapsed 
 14.378   0.276   3.344 

Overall, it's faster when I don't run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?

The build-from-source version of R is linked to Apple's Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was "charged" to the CPU(s).

David's tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft's R Open is linked against Intel's Math Kernel Library (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don't know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel's MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.

If you're going to be doing any linear algebra in R (and you will), it's important to link to an optimized BLAS. Otherwise, you're just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD's ACML library for AMD chips and the ATLAS library which is a general purpose tunable library. Also Goto's BLAS is optimized but is not under active development.


Profile of Hilary Parker

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

If you've ever wanted to know more about my Not So Standard Deviations co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the great profile of her on the American Statistical Association's This Is Statistics web site.

What advice would you give to high school students thinking about majoring in statistics?

It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.


Not So Standard Deviations Episode 7 - Statistical Royalty

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.

We also talk about Theranos and the pitfalls of diagnostic testing, Spotify's Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some "free advertising" to something interesting that they think other people should know about.

Show Notes:

Download the audio for this episode.