Tag: visualization


Introducing the healthvis R package - one line D3 graphics with R

We have been a little slow on the posting for the last couple of months here at Simply Stats. That’s bad news for the blog, but good news for our research programs!

Today I’m announcing the new healthvis R package that is being developed by my student Prasad Patil (who needs a website like yesterday), Hector Corrada Bravo, and myself*. The basic idea is that I have loved D3 interactive graphics for a while. But they are hard to create from scratch, since they require knowledge of both Javascript and the D3 library.

Even with those skills, it can take a while to develop a new graphic. On the other hand, I know a lot about R and am often analyzing biomedical data where interactive graphics could be hugely useful. There are a couple of really useful tools for creating interactive graphics in R, most notably Shiny, which is awesome. But these tools still require a bit of development to get right and are designed for “stand alone” tools.

So we created an R package that builds specific graphs that come up commonly in the analysis of health data like survival curves, heatmaps, and icon arrays. For example, here is how you make an interactive survival plot comparing treated to untreated individuals with healthvis:

# Load libraries


# Run a cox proportional hazards regression

cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)

# Plot using healthvis - one line!

survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))

The “survivalVis” command above  produces an interactive graphic like this. Here it is embedded (you may have to scroll to see the dropdowns on the right - we are working on resizing)

The advantage of this approach is that you can make common graphics interactive without a lot of development time. Here are some other unique features:

  • The graphics are hosted on Google App Engine. With one click you can get a permanent link and share it with collaborators.

  • With another click you can get the code to embed the graphics in your website.

  • If you have already created D3 graphics it only takes a few minutes to develop a healthvis version to let R users create their own - email us and we will make it part of the healthvis package!

  • healthvis is totally general - you can develop graphics that don’t have anything to do with health with our framework. Just email to get in touch if you want to be a developer at healthvis@gmail.com

We have started a blog over at healthvis.org where we will be talking about the tricks we learn while developing D3 graphics, updates to the healthvis package, and generally talking about visualization for new technologies like those developed by the CCNE and individualized health. If you are interested in getting involved as a developer, user or tester, drop us a line and let us know. In the meantime, happy visualizing!

* This project is supported by the JHU CCNE (U54CA151838) and the Johns Hopkins inHealth initiative.


Sunday data/statistics link roundup (1/27/2013)

  1. Wisconsin is decoupling the education and degree granting components of education. This means if you take a MOOC like mine, Brian's or Roger's and there is an equivalent class to pass at Wisconsin, you can take the exam and get credit. This is big. (via Rafa)
  2. This  is a really cool MLB visualisation done with d3.js and Crossfilter. It was also prototyped in R, which makes it even cooler. (via Rafa via Chris V.)
  3. Harvard is encouraging their professors to only publish in open access journals and to resign from closed access journals. This is another major change and bodes well for the future of open science (again via Rafa - noticing a theme this week?).
  4. This deserves a post all to itself, but Greece is prosecuting a statistician for analyzing data in a way that changed their deficit figure. I wonder what the folks at the International Year of Statistics think about that? (via Alex N.)
  5. Be on the twitters at 10:30AM Tuesday and follow the hashtag #jhsph753 if you want to hear all the crazy stuff I tell my students when I'm running on no sleep.
  6. Thomas at StatsChat is fed up with Nobel correlations. Although I'm still partial to the length of country name association.

Sunday Data/Statistics Link Roundup (9/9/12)

  1. Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to “how to google” (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. 
  2. A really nice presentation on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?
  3. We talked a lot about the ENCODE project this week. We had an interview with Steven Salzberg, then Rafa followed it up with a discussion of top-down vs. bottom-up science. Tons of data from the ENCODE project is now available, there is even a virtual machine with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “that’s a clown post, bro”.
  4. Another breathless post from the Chronicle about how there are “dozens of plagiarism cases being reported on Coursera”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in my course, hopefully plagiarism software will be in place by then. 
  5. A New York Times article about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau says in his book. This one may just be targeted at scientists/engineers. (link via Dan S.)
  6. Edo and co. are putting together a workshop on the analysis of social network data for NIPS in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!

Sunday Data/Statistics Link Roundup (9/2/2012)

  1. Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport. A fitting welcome back to the states. Although, as we talked about in our first podcast, I wonder how long the Big Data hype will last…
  2. Simina B. sent this link along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us. 
  3. If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s weekly stories. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”. 
  4. This seems almost like the fast statistics journal I proposed earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some really interesting ideas on how. 
  5. My wife taught for a year at Grinnell in Iowa and loved it there. They just released this cool data set with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.). 
  6. From the way-back machine, a rant from Rafa about meetings. Stayed tuned this week for some Simply Statistics data about our first year on the series of tubes

How do I know if my figure is too complicated?

One of the key things every statistician needs to learn is how to create informative figures and graphs. Sometimes, it is easy to use off-the-shelf plots like barplots, histograms, or if one is truly desperate a pie-chart

But sometimes the information you are trying to communicate requires the development of a new graphic. I am currently working on a project with a graduate student where the standard illustration are Venn Diagrams - including complicated Venn Diagrams with 5 or 10 circles. 

As we were thinking about different ways of illustrating our data, I started thinking about what are the key qualities of a graphic and how do I know if it is too complicated. I realized that:

  1. Ideally just looking at the graphic one can intuitively understand what is going on, but sometimes for more technical/involved displays this isn’t possible
  2. Alternatively, I think a good plot should be able to be explained in 2 sentences or less. I think that is true for pretty much every plot I use regularly. 
  3. That isn’t including describing what different colors/sizes/shapes specifically represent in any particular version of the graphic. 

I feel like there is probably something to this in the Grammar of Graphics or in some of William Cleveland’s work. But this is one of the first times I’ve come up with a case where a new, generalizable, type of graph needs to be developed. 


Sunday data/statistics link roundup (4/22)

  1. Now we know who is to blame for the pie chart. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.)
  2. A nice article in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.” On the other hand, I agree with Rafa that academics are partially to blame for buying into the closed access hegemony. I think more than a boycott of a single publisher is needed; we need a change in culture. (first link also via Leonid K)
  3. A blog post on how to add a transparent image layer to a plot. For some reason, I have wanted to do this several times over the last couple of weeks, so the serendipity of seeing it on R Bloggers merited a mention. 
  4. I agree the Earth Institute needs a better graphics advisor. (via Andrew G.)
  5. A great article on why multiple choice tests are used - they are an easy way to collect data on education. But that doesn’t mean they are the right data. This reminds me of the Tukey quote: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. It seems to me if you wanted to have a major positive impact on education right now, the best way would be to develop a new experimental design that collects the kind of data that really demonstrates mastery of reading/math/critical thinking. 
  6. Finally, a bit of a bleg…what is the best way to do the SVD of a huge (think 1e6 x 1e6), sparse matrix in R? Preferably without loading the whole thing into memory…

Sunday data/statistics link roundup (4/15)

  1. Incredibly cook, dynamic real-time maps of wind patterns in the United States. (Via Flowing Data)
  2. A d3.js coding tool that updates automatically as you update the code. This is going to be really useful for beginners trying to learn about D3. Real time coding (Via Flowing Data)
  3. An interesting blog post describing why the winning algorithm in the Netflix prize hasn’t actually been implemented! It looks like it was too much of an engineering hassle. I wonder if this will make others think twice before offering big sums for prizes like this. Unless the real value is advertising…(via Chris V.)
  4. An article about a group at USC that plans to collect all the information from apps that measure heart beats. Their project is called everyheartbeat. I think this is a little bit pre-mature, given the technology, but certainly the quantified self field is heating up. I wonder how long until the target audience for these sorts of projects isn’t just wealthy young technofiles? 
  5. A really good deconstruction of a recent paper suggesting that the mood on Twitter could be used to game the stock market. The author illustrates several major statistical flaws, including not correcting for multiple testing, an implausible statistical model, and not using a big enough training set. The scary thing is apparently a hedge fund is teaming up with this group of academics to try to implement their approach. I wouldn’t put my money anywhere they can get their hands on it. This is just one more in the accelerating line of results that illustrate the critical need for statistical literacy both among scientists and in the general public.

R and the little data scientist's predicament

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video. 

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax. 

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background. 
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures. 
  3. It would be awesome if the functions would include some sort of dynamic graphics (with svgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn. 

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends. 


Sunday data/statistics link roundup (3/4)

  1. A cool article on Github by the folks at Wired. I’m starting to think the fact that I’m not on Github is a serious dent in my nerd cred. 
  2. Datawrapper - a less intensive, but less flexible open source data visualization creator. I have seen a few of these types of services starting to pop up. I think that some statistics training should be mandatory before people use them. 
  3. An interesting blog post with the provocative title, “Why bother publishing in a journal” The story he describes works best if you have a lot of people who are interested in reading what you put on the internet. 
  4. A post on stackexchange comparing the machine learning and statistics cultures. 
  5. Stackoverflow is a great place to look for R answers. It is the R mailing list, minus the flames…
  6. Roger’s posts on Beijing air pollution are worth another read if you missed them. Particularly this one, where he computes the cigarette equivalent of the air pollution levels. 

A wordcloud comparison of the 2011 and 2012 #SOTU

I wrote a quick (and very dirty) R script for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches*. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. Here is a higher resolution version. 

The focus on jobs hasn’t changed much. But it is interesting how the 2012 speech seems to focus more on practical issues (tax, pay, manufacturing, oil) versus more emotional issues in 2011 (future, schools, laughter, success, dream). 

*The wordcloud R package does all the heavy lifting.