Simply Statistics

14
May

Computational biologist blogger saves computer science department

People who read the news should be aware by now that we are in the midst of a big data era. The New York Times, for example, has been writing about this frequently. One of their most recent articles describes how UC Berkeley is getting $60 million dollars for a new computer science center. Meanwhile, at University of Florida the administration seems to be oblivious to all this and about a month ago announced it was dropping its computer science department to save $. Blogger Steven Salzberg, a computational biologists known for his work in genomics, wrote a post titled “University of Florida eliminates Computer Science Department. At least they still have football” ridiculing UF for their decisions. Here are my favorite quotes:

 in the midst of a technology revolution, with a shortage of engineers and computer scientists, UF decides to cut computer science completely? 

Computer scientist Carl de Boor, a member of the National Academy of Sciences and winner of the 2003 National Medal of Science, asked the UF president “What were you thinking?”

Well, his post went viral and days later UF reversed it’s decision! So my point is this: statistics departments, be nice to bloggers that work in genomics… one of them might save your butt some day.

Disclaimer: Steven Salzberg has a joint appointment in my department and we have joint lab meetings.

13
May

Sunday data/statistics link roundup (5/13)

  1. Patenting statistical sampling? I’m pretty sure the Supreme Court who threw out the Mayo Patent wouldn’t have much trouble tossing this patent either. The properties of sampling are a “law of nature” right? via Leonid K.
  2. This video has me all fired up, its called 23 1/2 hours and talks about how the best preventative health measure is getting 30 minutes of exercise - just walking - every day. He shows how in some cases this beats doing much more high-tech interventions. My favorite part of this video is how he uses a ton of statistical/epidemiological terms like “effect sizes”, “meta-analysis”, “longitudinal study”, “attributable fractions”, but makes them understandable to a broad audience. This is a great example of “statistics for good”.
  3. A very nice collection of 2-minute tutorials in R. This is a great way to teach the concepts, most of which don’t need more than 2 minutes, and it covers a lot of ground. One thing that drives me crazy is when I go into Rafa’s office with a hairy computational problem and he says, “Oh you didn’t know about function x?”. Of course this only happens after I’ve wasted an hour re-inventing the wheel. If more people put up 2 minute tutorials on all the cool tricks they know, the better we’d all be.
  4. A plot using ggplot2, developed by this week’s interviewee Hadley Wickham appears in the Atlantic! Via David S.
  5. I’m refusing to buy into Apple’s hegemony, so I’m still running OS 10.5. I’m having trouble getting github up and running. Anyone have this same problem/know a solution? I know, I know, I’m way behind the times on this…
11
May

Interview with Hadley Wickham - Developer of ggplot2

Hadley Wickham



Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?

I’m an assistant professor of statistics, so I at least partly
associate with statistics :).  But the idea of data science really
resonates with me: I like the combination of tools from statistics and
computer science, data analysis and hacking, with the core goal of
developing a better understanding of data. Sometimes it seems like not
much statistics research is actually about gaining insight into data.

You have created/maintain several widely used R packages. Can you
describe the unique challenges to writing and maintaining packages
above and beyond developing the methods themselves?

I think there are two main challenges: turning ideas into code, and
documentation and community building.

Compared to other languages, the software development infrastructure
in R is weak, which sometimes makes it harder than necessary to turn
my ideas into code. Additionally, I get less and less time to do
software development, so I can’t afford to waste time recreating old
bugs, or releasing packages that don’t work. Recently, I’ve been
investing time in helping build better dev infrastructure; better
tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve
found unit tests to be a worthwhile investment: they ensure you never
accidentally recreate an old bug, and give you more confidence when
radically changing the implementation of a function.

Documenting code is hard work, and it’s certainly something I haven’t
mastered. But documentation is absolutely crucial if you want people
to use your work. I find the main challenge is putting yourself in the
mind of the new user: what do they need to know to use the package
effectively. This is really hard to do as a package author because
you’ve internalised both the motivating problem and many of the common
solutions.

Connected to documentation is building up a community around your
work. This is important to get feedback on your package, and can be
helpful for reducing the support burden. One of the things I’m most
proud of about ggplot2 is something that I’m barely responsible for:
the ggplot2 mailing list. There are now ggplot2 experts who answer far
more questions on the list than I do. I’ve also found github to be
great: there’s an increasing community of users proficient in both R
and git who produce pull requests that fix bugs and add new features.

The flip side of building a community is that as your work becomes
more popular you need to be more careful when releasing new versions.
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN
packages, and forced me to rethink my release process. Now I advertise
releases a month in advance, and run `R CMD check` on all downstream
dependencies (`devtools::revdep_check` in the development version), so
I can pick up potential problems and give other maintainers time to
fix any issues.

Do you feel that the academic culture has caught up with and supports
non-traditional academic contributions (e.g. R packages instead of
papers)?

It’s hard to tell. I think it’s getting better, but it’s still hard to
get recognition that software development is an intellectual activity
in the same way that developing a new mathematical theorem is. I try
to hedge my bets by publishing papers to accompany my major packages:
I’ve also found the peer-review process very useful for improving the
quality of my software. Reviewers from both the R journal and the
Journal of Statistical Software have provided excellent suggestions
for enhancements to my code.

You have given presentations at several start-up and tech companies.
Do the corporate users of your software have different interests than
the academic users?

By and large, no. Everyone, regardless of domain, is struggling to
understand ever larger datasets. Across both industry and academia,
practitioners are worried about reproducible research and thinking
about how to apply the principles of software engineering to data
analysis.

You gave one of my favorite presentations called Tidy Data/Tidy Tools
at the NYC Open Statistical Computing Meetup. What are the key
elements of tidy data that all applied statisticians should know?

Thanks! Basically, make sure you store your data in a consistent
format, and pick (or develop) tools that work with that data format.
The more time you spend munging data in the middle of an analysis, the
less time you have to discover interesting things in your data. I’ve
tried to develop a consistent philosophy of data that means when you
use my packages (particularly plyr and ggplot2), you can focus on the
data analysis, not on the details of the data format. The principles
of tidy data that I adhere to are that every column should be a
variable, every row an observation, and different types of data should
live in different data frames. (If you’re familiar with database
normalisation this should sound pretty familiar!). I expound these
principles in depth in my in-progress [paper on the
topic]

How do you decide what project to work on next? Is your work inspired
by a particular application or more general problems you are trying to
tackle?

Very broadly, I’m interested in the whole process of data analysis:
the process that takes raw data and converts it into understanding,
knowledge and insight. I’ve identified three families of tools
(manipulation, modelling and visualisation) that are used in every
data analysis, and I’m interested both in developing better individual
tools, but also smoothing the transition between them. In every good
data analysis, you must iterate multiple times between manipulation,
modelling and visualisation, and anything you can do to make that
iteration faster yields qualitative improvements to the final analysis
(that was one of the driving reasons I’ve been working on tidy data).

Another factor that motivates a lot of my work is teaching. I hate
having to teach a topic that’s just a collection of special cases,
with no underlying theme or theory. That drive lead to [stringr] (for
string manipulation) and [lubridate] (with Garrett Grolemund for working
with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.

What do you see as the biggest open challenges in data visualization
right now? Do you see interactive graphics becoming more commonplace?

I think one of the biggest challenges for data visualisation is just
communicating what we know about good graphics. The first article
decrying 3d bar charts was published in 1951! Many plots still use
rainbow scales or red-green colour contrasts, even though we’ve known
for decades that those are bad. How can we ensure that people
producing graphics know enough to do a good job, without making them
read hundreds of papers? It’s a really hard problem.

Another big challenge is balancing the tension between exploration and
presentation. For explotary graphics, you want to spend five seconds
(or less) to create a plot that helps you understand the data, while you might spend
five hours on a plot that’s persuasive to an audience who
isn’t as intimately familiar with the data as you. To date, we have
great interactive graphics solutions at either end of the spectrum
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one
end of the spectrum to the other. This summer I’ll be spending some
time thinking about what ggplot2 + [d3], might
equal, and how we can design something like an interactive grammar of
graphics that lets you explore data in R, while making it easy to
publish interaction presentation graphics on the web.

10
May

What are the products of data analysis?

Thanks to everyone for the feedback on my post on knowing when someone is good at data analysis. A couple people suggested I take a look here for a few people who have proven they’re good at data analysis. I think that’s a great idea and a good place to start.

But I also think that while demonstrating an ability to build good prediction models is impressive and definitely shows an understanding of the data, not all important problems can be easily posed as prediction problems. Most of my work does not involve prediction at all and the problems I face (i.e., estimating very small effects in the presence of large unmeasured confounding factors) would be difficult to formulate as a prediction challenge (at least, I can’t think of an easy way). In fact, part of my and my colleagues’ research involves showing how statistical methods designed for prediction problems can fail miserably when applied to other non-prediction settings.

The general question I have is what is a useful product that you can produce from a data analysis that demonstrates the quality of that analysis? So, a very small mean squared error from a prediction model would be one product (especially if it were smaller than everyone else’s). Maybe a cool graph with a story behind it? 

If I were hiring a musician for an orchestra, I wouldn’t have to meet that person to have strong evidence that he/she were good. I could just listen to some recordings of that person playing and that would be a pretty good predictor of how that person would perform in the orchestra. In fact, some major orchestras do completely blind auditions so that although the person is present in the room, all you hear is the sound of the playing.

What seems to be true with music at least, is that even though the final performance doesn’t specifically reveal the important decisions that were made along the way to craft the interpretation of the music, somehow one is still able to appreciate the fact that all those decisions were made and they benefitted the performance. To me, it seems unlikely to arrive at a sublime performance either by chance or by some route that didn’t involve talent and hard work. Maybe it could happen once, but to produce a great performance over and over requires more than just luck.

What products could you send to someone to convince them you were good at data analysis? I raise this question primarily because when I look around at the products that I make (research papers, software, books, blogs), even if they are very good, I don’t think they necessarily convey any useful information about my ability to analyze data.

What’s the data analysis equivalent of a musician’s performance?

09
May
08
May
07
May

How do you know if someone is great at data analysis?

Consider this exercise. Come up with a list of the top 5 people that you think are really good at data analysis.

There’s one catch: They have to be people that you’ve never met nor have had any sort of personal interaction with (e.g. email, chat, etc.). So basically people who have written papers/books you’ve read or have given talks you’ve seen or that you know through other publicly available information. Who comes to mind? It’s okay to include people who are no longer living.

The other day I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people.

I think there are a few reasons. First, people who are great at data analysis are likely not publishing papers or being productive in a manner that I, an outsider, would be able to observe. If they’re working at a pharmaceutical company working on a new drug or at some fancy new startup company, there’s no way I’m ever going to know about it unless I’m directly involved.

Another reason is that even for people who are well-known scientists or statisticians, the products they produce don’t really highlight the difficulties overcome in data analysis. For example, many good papers in the statistics literature will describe a new method with brief reference to the data that inspired the method’s development. In those cases, the data analysis usually appears obvious, as most things do after they’ve been done. Furthermore, papers usually exclude all the painful details about merging, cleaning, and inspecting the data as well as all the other things you tried that didn’t work. Papers in the substantive literature have a similar problem, which is that they focus on a scientific problem of interest and the analysis of the data is secondary.

As skills in data analysis become more important, it seems odd to me that we don’t have a great way to evaluate a person’s ability to do it as we do in other areas.

06
May
05
May

UCLA Data Fest 2012

The very very cool UCLA Data Fest is going on as we speak. This is a statistical analysis marathon where teams of undergrads work through the night (and day) to address an important problem through data analysis. Last year they looked at crime data from the Los Angeles Police Department. I’m looking forward to seeing how this year goes.

Great work by Rob Gould and the Department of Statistics there.

04
May

New National Academy of Sciences Members

The National Academy of Sciences elected new members a few days ago. Among them are statisticians Robert Tibshirani and sociologist Stephen Raudenbush. Obviously well-deserved!

(Thanks to Karl Broman.)