Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don't know what that is, ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.
But I don't use ggplot2 and I get nervous when other people do.
I get no end of grief for this from Hilary and Roger and especially from drob, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don't use it, you have to understand the three cases where I use data visualization.
- When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set
- When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.
- When grading student data analyses.
Let's consider each case.
Exploratory graphs don't have to be pretty. I'm going to be the only one who looks at 99% of them. But I have to be able to make them quickly and I have to be able to make a broad range of plots with minimal code. There are a large number of types of graphs, including things like heatmaps, that don't neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (like this one) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don't get any of the advantages of the ggplot2 framework.
When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the comparison of different plotting systems for creating Tufte-like graphs. To create this minimal barchart:
The code they use in base graphics is this (super blurry sorry, you can also go to the website for a better view).
in ggplot2 the code is:
Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.
The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.
Grading student work
People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the little data scientist's predicament. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of quakes data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:
ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))
And get this out:
That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn't even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.
The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.
Where ggplot2 is better for sure
ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create awesome add-ons are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I'm psyched about the reach and power ggplot2 has had. Still, I probably won't use it for my own work, even thought it disappoints my data scientist friends.