Tag: data visualization

23
Dec

Sunday data/statistics link roundup 12/23/12

  1. A cool data visualization for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer's Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got this link from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).
  2. Here is an interesting review of Nate Silver's book. The interesting thing about the review is that it doesn't criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we've seen before. Gelman also reviews the review.
  3. It's a little late now, but this tool seems useful for folks who want to know whatdoineedonmyfinal?
  4. A list of the best open data releases of 2012. I particularly like the rat sightings in New York and think the Baltimore fixed speed cameras (which I have a habit of running afoul of).
  5. A map of data scientists on Twitter.  Unfortunately, since we don't have "data scientist" in our Twitter description, Simply Statistics does not appear. I'm sure we would have been central....
  6. Here is an interesting paper where some investigators developed a technology that directly reads out a bar chart of the relevant quantities. They mention this means there is no need for statistical analysis. I wonder if the technology also reads out error bars.
09
Dec

Sunday data/statistics link roundup (12/9/12)

  1. Some interesting data/data visualizations about working conditions in the apparel industry. Here is the full report. Whenever I see reports like this, I wish the raw data were more clearly linked. I want to be able to get in, play with the data, and see if I notice something that doesn't appear in the infographics. 
  2. This is an awesome plain-language discussion of how a bunch of methods (CS and Stats) with fancy names relate to each other. It shows that CS/Machine Learning/Stats are converging in many ways and there isn't much new under the sun. On the other hand, I think the really exciting thing here is to use these methods on new questions, once people drop the stick
  3. If you are a reader of this blog and somehow do not read anything else on the internet, you will have missed Hadley Wickham's Rcpp tutorial. In my mind, this pretty much seals it, Julia isn't going to overtake R anytime soon. In other news, Hadley is coming to visit JHSPH Biostats this week! I'm psyched to meet him. 
  4. For those of us that live in Baltimore, this interesting set of data visualizations lets you in on the crime hotspots. This is a much fancier/more thorough analysis than Rafa and I did way back when. 
  5. Check out the new easy stats tool from the Census (via Hilary M.) and read our interview with Tom Louis who is heading over there to the Census to do cool things. 
  6. Watch out, some Tedx talks may be pseudoscience! More later this week on the politicization/glamourization of science, so stay tuned. 
26
Nov

The statisticians at Fox News use classic and novel graphical techniques to lead with data

Depending on where you land in the political spectrum you may either love or despise Fox News. But regardless of your political affiliation, you have to recognize that their statisticians are well-trained in the art of using graphics to persuade folks of a particular viewpoint. I'm not the first to recognize that the graphics department uses some clever tricks to make certain points. But when flipping through the graphs I thought it was interesting to highlight some of the techniques they use to persuade. Some are clearly classics from the literature, but some are (as far as I can tell) newly developed graphical "persuasion" techniques.

Truncating the y-axis

(via)

and

(via)

This is a pretty common technique for leading the question in statistical graphics, as discussed here and elsewhere.

Numbers that don't add up

I'm not sure whether this one is intentional or not, but it crops up in several places and I think is a unique approach to leading information, at least I couldn't find a reference in the literature. Basically the idea is to produce percentages that don't add to one, allowing multiple choices to have closer percentages than they probably should:

(via)

or to suggest that multiple options are all equally likely, but also supported by large percentages:

(via)

Changing the units of comparison

When two things are likely to be very similar, one approach to leading information is to present variables in different units. Here is an example where total spending for 2010-2013 is compared to deficits in 2008. This can also be viewed as an example of not labeling the axes.

 (via)

Changing the magnitude of units at different x-values

Here is a plot where the changes in magnitude at high x-values are higher than changes in magnitude at lower x-values. Again, I think this is actually a novel graphical technique for leading readers in one direction.

(via)

To really see the difference, compare to the graph with common changes in magnitude at all x-values.

(via)

Changing trends by sub-sampling x values (also misleading chart titles)

Here is a graph that shows unemployment rates over time and the corresponding chart with the x-axis appropriately laid out.

(via)

One could argue these are mistakes, but based on the consistent displays of data supporting one viewpoint, I think these are likely the result of someone with real statistical training who is using data in a very specific way to make a point. Obviously, Fox News isn't the only organization that does this sort of thing, but it is interesting to see how much effort they put into statistical graphics.

01
Jul

Sunday data/statistics link roundup (7/1)

  1. A really nice explanation of the elements of Obamacare. Rafa’s post on the new inHealth initiative Scott is leading got a lot of comments on Reddit. Some of them are funny (Rafa’s spelling got rocked) and if you get past the usual level of internet-commentary politeness, some of them seem to be really relevant - especially the comments about generalizability and the economics of health care. 
  2. From Andrew J. a cool visualization of the human genome, they are showing every base of the human genome over the course of a year. That turns out to be about 100 bases per second. I think this is a great way to show how much information is in just one human genome. It also puts the sequencing data deluge in perspective. We are now sequencing thousands of these genomes a year and its only going to get faster. 
  3. Cosma Shalizi has a nice list of unsolved problems in statistics on his blog (via Edo A.). These problems primarily fall into what I call Category 1 problems in my post on motivating statistical projects. I think he has some really nice insight though and some of these problems sound like a big deal if one was able to solve them.
  4. A really provocative talk on why consumers are the job creators. The issue of who are the job creators seems absolutely ripe for a thorough statistical analysis. There are a thousand confounders here and my guess is that most of the work so far has been Category 2 - let’s use convenient data to make a stab at this. But a thorough and legitimate data analysis would be hugely impactful. 
  5. Your eReader is collecting data about you.
27
May

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads
11
May

Interview with Hadley Wickham - Developer of ggplot2

Hadley Wickham



Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?

I’m an assistant professor of statistics, so I at least partly
associate with statistics :).  But the idea of data science really
resonates with me: I like the combination of tools from statistics and
computer science, data analysis and hacking, with the core goal of
developing a better understanding of data. Sometimes it seems like not
much statistics research is actually about gaining insight into data.

You have created/maintain several widely used R packages. Can you
describe the unique challenges to writing and maintaining packages
above and beyond developing the methods themselves?

I think there are two main challenges: turning ideas into code, and
documentation and community building.

Compared to other languages, the software development infrastructure
in R is weak, which sometimes makes it harder than necessary to turn
my ideas into code. Additionally, I get less and less time to do
software development, so I can’t afford to waste time recreating old
bugs, or releasing packages that don’t work. Recently, I’ve been
investing time in helping build better dev infrastructure; better
tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve
found unit tests to be a worthwhile investment: they ensure you never
accidentally recreate an old bug, and give you more confidence when
radically changing the implementation of a function.

Documenting code is hard work, and it’s certainly something I haven’t
mastered. But documentation is absolutely crucial if you want people
to use your work. I find the main challenge is putting yourself in the
mind of the new user: what do they need to know to use the package
effectively. This is really hard to do as a package author because
you’ve internalised both the motivating problem and many of the common
solutions.

Connected to documentation is building up a community around your
work. This is important to get feedback on your package, and can be
helpful for reducing the support burden. One of the things I’m most
proud of about ggplot2 is something that I’m barely responsible for:
the ggplot2 mailing list. There are now ggplot2 experts who answer far
more questions on the list than I do. I’ve also found github to be
great: there’s an increasing community of users proficient in both R
and git who produce pull requests that fix bugs and add new features.

The flip side of building a community is that as your work becomes
more popular you need to be more careful when releasing new versions.
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN
packages, and forced me to rethink my release process. Now I advertise
releases a month in advance, and run `R CMD check` on all downstream
dependencies (`devtools::revdep_check` in the development version), so
I can pick up potential problems and give other maintainers time to
fix any issues.

Do you feel that the academic culture has caught up with and supports
non-traditional academic contributions (e.g. R packages instead of
papers)?

It’s hard to tell. I think it’s getting better, but it’s still hard to
get recognition that software development is an intellectual activity
in the same way that developing a new mathematical theorem is. I try
to hedge my bets by publishing papers to accompany my major packages:
I’ve also found the peer-review process very useful for improving the
quality of my software. Reviewers from both the R journal and the
Journal of Statistical Software have provided excellent suggestions
for enhancements to my code.

You have given presentations at several start-up and tech companies.
Do the corporate users of your software have different interests than
the academic users?

By and large, no. Everyone, regardless of domain, is struggling to
understand ever larger datasets. Across both industry and academia,
practitioners are worried about reproducible research and thinking
about how to apply the principles of software engineering to data
analysis.

You gave one of my favorite presentations called Tidy Data/Tidy Tools
at the NYC Open Statistical Computing Meetup. What are the key
elements of tidy data that all applied statisticians should know?

Thanks! Basically, make sure you store your data in a consistent
format, and pick (or develop) tools that work with that data format.
The more time you spend munging data in the middle of an analysis, the
less time you have to discover interesting things in your data. I’ve
tried to develop a consistent philosophy of data that means when you
use my packages (particularly plyr and ggplot2), you can focus on the
data analysis, not on the details of the data format. The principles
of tidy data that I adhere to are that every column should be a
variable, every row an observation, and different types of data should
live in different data frames. (If you’re familiar with database
normalisation this should sound pretty familiar!). I expound these
principles in depth in my in-progress [paper on the
topic]

How do you decide what project to work on next? Is your work inspired
by a particular application or more general problems you are trying to
tackle?

Very broadly, I’m interested in the whole process of data analysis:
the process that takes raw data and converts it into understanding,
knowledge and insight. I’ve identified three families of tools
(manipulation, modelling and visualisation) that are used in every
data analysis, and I’m interested both in developing better individual
tools, but also smoothing the transition between them. In every good
data analysis, you must iterate multiple times between manipulation,
modelling and visualisation, and anything you can do to make that
iteration faster yields qualitative improvements to the final analysis
(that was one of the driving reasons I’ve been working on tidy data).

Another factor that motivates a lot of my work is teaching. I hate
having to teach a topic that’s just a collection of special cases,
with no underlying theme or theory. That drive lead to [stringr] (for
string manipulation) and [lubridate] (with Garrett Grolemund for working
with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.

What do you see as the biggest open challenges in data visualization
right now? Do you see interactive graphics becoming more commonplace?

I think one of the biggest challenges for data visualisation is just
communicating what we know about good graphics. The first article
decrying 3d bar charts was published in 1951! Many plots still use
rainbow scales or red-green colour contrasts, even though we’ve known
for decades that those are bad. How can we ensure that people
producing graphics know enough to do a good job, without making them
read hundreds of papers? It’s a really hard problem.

Another big challenge is balancing the tension between exploration and
presentation. For explotary graphics, you want to spend five seconds
(or less) to create a plot that helps you understand the data, while you might spend
five hours on a plot that’s persuasive to an audience who
isn’t as intimately familiar with the data as you. To date, we have
great interactive graphics solutions at either end of the spectrum
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one
end of the spectrum to the other. This summer I’ll be spending some
time thinking about what ggplot2 + [d3], might
equal, and how we can design something like an interactive grammar of
graphics that lets you explore data in R, while making it easy to
publish interaction presentation graphics on the web.