Simply Statistics


What makes a good data scientist?

Apparently, New Year's Eve is not a popular day to come to the office as it seems I'm the only one here. No matter, it just means I can blast Mahler 3 (Bernstein, NY Phil, 1980s recording) louder than I normally would.

Today's post is inspired by this latest article in the NYT about big data. The article for the most part describes a conference that happened at MIT recently on the topic of big data. Towards the end of the article, it is noted that one of the participants (Rachel Schutt) was asked what makes a good data scientist.

Obviously, she replied, the requirements include computer science and math skills, but you also want someone who has a deep, wide-ranging curiosity, is innovative and is guided by experience as well as data.

“I don’t worship the machine,” she said.

I think I agree, but I would have put it a different way. Mostly, I think what makes a good data scientist is the same thing that makes you a good [insert field here] scientist. In other words, a good data scientist is a good scientist.


Sunday data/statistics link roundup (12/30/12)

  1. An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here's a post describing it on the blog. As far as I can tell, the app is still in beta, so only the folks who have a code can download it.
  2. Data on mass shootings from the Mother Jones investigation.
  3. A post by Hilary M. on "Getting Started with Data Science". I really like the suggestion of just picking a project and doing something, getting it out there. One thing I'd add to the list is that I would spend a little time learning about an area you are interested in. With all the free data out there, it is easy to just "do something", without putting in the requisite work to know why what you are doing is good/bad. So when you are doing something, make sure you take the time to "know something".
  4. An analysis of various measures of citation impact (also via Hilary M.). I'm not sure I follow the reasoning behind all of the analyses performed (seems a little like throwing everything at the problem and hoping something sticks) but one interesting point is how citation/usage are far apart from each other on the PCA plot. This is likely just because the measures cluster into two big categories, but it makes me wonder. Is it better to have a lot of people read your paper (broad impact?) or cite your paper (deep impact?).
  5. An interesting conversation on Twitter about how big data does not mean you can ignore the scientific method. We have talked a little bit about this before, in terms of how one should motivate statistical projects.

Sunday data/statistics link roundup 12/23/12

  1. A cool data visualization for blood glucose levels for diabetic individuals. This kind of interactive visualization can help people see where/when major health issues arise for chronic diseases. This was a class project by Jeff Heer's Stanford CS448B students Ben Rudolph and Reno Bowen (twitter @RenoBowen). Speaking of interactive visualizations, I also got this link from Patrick M. It looks like a way to build interactive graphics and my understanding is it is compatible with R data frames, worth checking out (plus, Dex is a good name).
  2. Here is an interesting review of Nate Silver's book. The interesting thing about the review is that it doesn't criticize the statistical content, but criticizes the belief that people only use data analysis for good. This is an interesting theme we've seen before. Gelman also reviews the review.
  3. It's a little late now, but this tool seems useful for folks who want to know whatdoineedonmyfinal?
  4. A list of the best open data releases of 2012. I particularly like the rat sightings in New York and think the Baltimore fixed speed cameras (which I have a habit of running afoul of).
  5. A map of data scientists on Twitter.  Unfortunately, since we don't have "data scientist" in our Twitter description, Simply Statistics does not appear. I'm sure we would have been central....
  6. Here is an interesting paper where some investigators developed a technology that directly reads out a bar chart of the relevant quantities. They mention this means there is no need for statistical analysis. I wonder if the technology also reads out error bars.

The NIH peer review system is still the best at identifying innovative biomedical investigators

This recent Nature paper makes the controversial claim that the most innovative (interpreted as best) scientists are not being funded by NIH. Not surprisingly, it is getting a lot of attention in the popular media. The title and introduction make it sound like there is a pervasive problem biasing the funding enterprise against innovative scientists. To me this appears counterintuitive given how much innovation, relative to other funding agencies around the world, comes out of NIH funded researchers (here is a recent example) and how many of the best biomedical investigators in the world elect to work for NIH funded institutions. The authors use data to justify their conclusions but I do not find it very convincing.

First, the paper defines innovative/non-conformist scientists as those with a first/last/single author paper with 1000+ citations in the years 2002-2012. Obvious problems with this definition are already pointed out in the comments of the original paper but for argument's sake I will accept it as useful quantification  The key data point the authors use is that only 2/5 of people with a first/last single author 1000+ citation paper are principal investigators on NIH grants. I would need to see the complete 2x2 table for people that actually applied for grants (1000+ citations or not x got NIH grant  or not) to be convinced. The reported ratio is meaningful only if most people with 1000+ papers are applying for grants but the authors doen't report how many are retired, or are still postdocs, or went into industry, or are one-hit-wonders. Given that the payline is about 8%-15%, the 40% number may actually imply that NIH is in fact funding innovative people at a high rate.

The paper also implies that many of the undeserving funding recipients are connected individuals that serve on study sections. The evidence for this is that they are funded at a much higher rate than individuals with 1000+ citation papers. But as the authors themselves point out, study section members are often recruited from the subset of individuals who have NIH grants (it's a way to give back to NIH).  This does not suggest bias in the process, it just suggests that if you recruit funded people to be on a panel, that panel will have a higher rate of funded people.

NIH's peer review system is far from perfect but it somehow manages to produce the best biomedical research in the world. How does this happen? Well, I think it's because NIH is currently funding some of the most innovative biomedical researchers in the world. The current system can certainly improve, but perhaps we should focus on concrete proposals with hard evidence that they will actually make things better.

Disclaimers: I am a regular member of an NIH study section. I am PI on NIH grants. I am on several papers with more than 1000 citations.


Rafa interviewed about statistical genomics

He talks about the problems created by the speed of increase in data sizes in molecular biology, the way that genomics is hugely driven by data analysis/statistics, how Bioconductor is an example of bottom up science, Simply Statistics gets a shout out, how new data are going to lead to new modeling/statistical challenges, and gives an ode to boxplots. It's worth watching the whole thing...


The value of re-analysis

I just saw this really nice post over on John Cook's blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).

In a more statistical version of this idea, Gary King has proposed reproducing the analysis in a published paper as a way to get a paper of your own.  You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:

  1. Can you actually reproduce the code used to perform the analysis?
  2. Can you solve the "paper as puzzle" exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?

Both of these things require a bit more "higher level thinking" than just re-running the analysis if you have the code. But I think even the seemingly "low-level" task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out Rpubs or RunMyCode or even the right parts of Figshare you can find data analyses you can run through and reproduce.

The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) said, "Data analysis practice doesn't make perfect, perfect data analysis practice makes perfect."


Should the Cox Proportional Hazards model get the Nobel Prize in Medicine?

I'm not the first one to suggest that Biostatistics has been undervalued in the scientific community, and some of the shortcomings of epidemiology and biostatistics have been noted elsewhere. But this previous work focuses primarily on the contributions of statistics/biostatistics at the purely scientific level.

The Cox Proportional Hazards model is one of the most widely used statistical models in the analysis of data from clinical trials and other medical studies. The corresponding paper has been cited over 32,000 times; this is a dramatically low estimate of the number of times the model has been used. It is one of "those methods" that doesn't even require a reference to the original methods paper anymore.

Many of the most influential medical studies, including major studies like the Women's Health Initiative have used these methods to answer some of our most pressing medical questions. Despite the incredible impact of this statistical technique on the world of medicine and public health, it has not received the Nobel Prize. This isn't an aberration, statistical methods are not traditionally considered for Nobel Prizes in Medicine. They primarily focus on biochemical, genetic, or public health discoveries.

In contrast, many economics Nobel Prizes have been awarded primarily for the discovery of a new statistical or mathematical concept. One example is the ARCH model. The Nobel Prize in Economics in 2003 was awarded to Robert Engle, the person who proposed the original ARCH model. The model has gone on to have a major impact on financial analysis, much like the Cox model has had a major impact on medicine?

So why aren't Nobel Prizes in medicine awarded to statisticians more often? Other methods such as ANOVA, P-values, etc. have also had an incredibly large impact on the way we measure and evaluate medical procedures. Maybe as medicine becomes increasingly driven by data, we will start to see more statisticians recognized for their incredible discoveries and the huge contributions they make to medical research and practice.



Sunday data/statistics link roundup (12/16/12)

  1. A directory of open access journals. Very cool idea to aggregate them. Here is a blog post from one of my favorite statistics bloggers about why open-access journals are so cool. Just like in a lot of other areas, open access journals can be thought of as an open data initiative.
  2. Here is a website that displays data on the relative wealth of neighborhoods, broken down by census track. It's pretty fascinating to take a look and see what the income changes are, even in regions pretty close to each other.
  3. More citizen science goodness. Zooniverse has a new project where you can look through a bunch of pictures in the Serengeti and see if you can find animals.
  4. Nate Silver talking about his new book with Hal Varian. (via). I have skimmed the book and found that the parts about baseball/politics are awesome and the other parts seem a little light. But maybe that's just my pre-conceived bias? I'd love to hear what other people thought...

Computing for Data Analysis Returns

I'm happy to announce that my course Computing for Data Analysis will return to Coursera on January 2nd, 2013. While I had previously announced that the course would be presented again right here, it made more sense to do it again on Coursera where it is (still) free and the platform there is much richer. For those of you who missed it the last time around, this is your chance to take it and learn a little R.

I've gotten a number of emails from people who were interested in watching the videos for the course. If you just want to sit around and watch videos of me talking, I've created a set of four YouTube playlists based on the four weeks of the course:

The content in the YouTube playlists reflect the content from the first iteration of the course and will not reflect any new material I add to the second iteration (at least not for a little while).

I encourage everyone who is interested to enroll in the course on Coursera because there you'll have the benefit of in-video quizzes and other forms of assessment and will be able to interact with all of the great students who are also enrolled in the class. Also, if you're interested in signing up for Jeff Leek's Data Analysis course (starts on January 22, 2013) and are not very familiar with R, I encourage you to check out Computing for Data Analysis first to get yourself up to speed.

I look forward to seeing you there!