Simply Statistics


An instructor's thoughts on peer-review for data analysis in Coursera

I used peer-review for the data analysis course I just finished. As I mentioned in the post-mortem podcast I knew in advance that it was likely to be the most controversial component of the class. So it wasn't surprising that based on feedback in the discussion boards and on this blog, the peer review process is by far the thing students were most concerned about.

But to evaluate complete data analysis projects at scale there is no other alternative that is economically feasible. To give you an idea, I have our local students perform 3 data analyses in an 8 week term here at Johns Hopkins. There are generally 10-15 students in that class and I estimate that I spend around an hour reading each analysis, digesting what was done, and writing up comments. That means I usually spend almost an entire weekend grading just for 10-15 data analyses. If you extrapolate that out to the 5,000 or so people who turned in data analysis assignments, it is clearly not possible for me to do all the grading.

Another alternative would be to pay trained data analysts to grade all the assignments. Of course that would be expensive - you couldn't farm it out to the mechanical turk. If you want to get a better/more consistent grading scheme than peer review you'd need to hire highly trained data analysts to do that and that would be very expensive. While Johns Hopkins has been incredibly supportive in terms of technical support and giving me the flexibility to pursue the class, it is definitely something I did on my own time and with a lot of my own resources. It isn't clear that it make sense for Hopkins to pour huge resources into really high-quality grading. At the same time, I'm not sure Coursera could afford to do this for all of the classes where peer review is needed, as they are just a startup.

So I think that at least for the moment, peer review is the best option for grading. This has  big implications for the value of the Coursera statements of accomplishment in classe where peer review is necessary. I think that it would benefit Coursera hugely to do some research on how to ensure/maintain quality in peer review (Coursera - if you are reading this and you have some $$ you want to send my way to support some students/postdocs I have some ideas on how to do that). The good news is that the amazing Coursera platform collects so much data that it is possible to do that kind of research.



Podcast #6: Data Analysis MOOC Post-mortem

Jeff and I talk about Jeff's recently completed MOOC on Data Analysis.


Sunday data/statistics link roundup (3/24/2013)

  1. My Coursera Data Analysis class is done for now! All the lecture notes are on Github all the videos are on Youtube. They are tagged by week with tags "Week x".
  2. After ENAR the comments on how to have better stats conferences started flowing. Check out Frazee, Xie, and Broman. My favorite cherry picked ideas: conference app (frazee), giving the poster session more focus (frazee), free and announced wifi (broman), more social media (i loved following ENAR on twitter but wish there had been more tweeting) (xie), add some jokes to talks (xie).
  3. A related post is this one from Hilary M. on how a talk should entertain, not teach.
  4. This is a fascinating interview I found via AL Daily. My favorite lines? "You run into this attitude, that if ordinary people cannot set their Facebook privacy settings, then they deserve what is coming to them. There is a hacker superiority complex to this." I think this is certainly something we have a lot of in statistics as well.
  5. The CIA wants to collect all the dataz. Call me when cat videos become important for national security, ok guys?
  6. Given I just completed my class, the MOOC completion rates graph is pretty appropriate. I think my #'s are right in line with that other people report. I'm still trying to figure out how to know how many people "completed" the class.

Youtube should check its checksums

I am in the process of uploading the video lectures for Data Analysis. I am getting ready to send out the course wrap-up email and I wanted to include the link to the Youtube playlist as well.

Unfortunately, Youtube keeps reporting that a pair of the videos in week 2 are duplicates. This is true despite them being different lengths (12:15 vs. 16:58), having different titles, and having dramatically different content. I found this on the forums:

YouTube uses a checksum to determine duplicates. The chances of having two different files containing different content but have the same checksum would be astronomical.

That isn't on the official Google documentation page, which is pretty sparse, but is the only description I can find of how Youtube checks for duplicate content. A checksum is a function you apply to the data from a video that (ideally) with high probability will yield different values when different videos are uploaded and the same value when the same video is uploaded. One possible checksum function could be the length of the video. Obviously that won't work in general because many videos might be 2 minutes exactly.

Regardless, it looks like Youtube can't distinguish my lecture videos. I'm thinking Vimeo or something else if I can't get this figured out. Of course, if someone has a suggestion (short of re-exporting the videos from Camtasia) that would allow me to circumvent this problem I'd love to hear it!

Update: I ended up fiddling with the videos and got them to upload. Thanks to the helpful comments!



Call for papers for a special issue of Statistical Analysis and Data Mining

David Madigan sends the following. It looks like a really interesting place to submit papers for both statisticians and data scientists, so submit away!

Statistical Analysis and Data Mining, An American Statistical Association Journal

Call for Papers
Special Issue on Observational Healthcare Data
Guest Editors: Patrick Ryan, J&J and Marc Suchard, UCLA
Due date: July 1, 2013
Data sciences is the rapidly evolving field that integrates
mathematical and statistical knowledge, software engineering and large-scale data management skills, and domain expertise to tackle difficult problems that typically cannot be solved by any one discipline alone.  Some of the most difficult, and arguably most important, problems exist in healthcare.  Knowledge about human biology has exponentially advanced in the past two decades with exciting progress in genetics, biophysics, and pharmacology.  However, substantial opportunities exist to extend the evidence base about human disease, patient health and effects of medical interventions and translate knowledge into actions that can directly impact clinical care.  The emerging availability of 'big data' in healthcare, ranging from prospective research with aggregated genomics and clinical trials to observational data from administrative claims and electronic health records through social media, offer unprecedented opportunities for data scientists to contribute to advancing healthcare through the development, evaluation, and application of novel analytical solutions to explore these data to generate evidence at both the patient and population level.  Statistical and computational challenges abound and
methodological progress will draw on fields such as data mining,
epidemiology, medical informatics, and biostatistics to name but a
few.  This special issue of Statistical Analysis and Data Mining seeks to capture the current state of the art in healthcare data sciences. We welcome contributions that focus on methodology for healthcare data and original research that demonstrates the application of data sciences to problems in public health.

Sunday data/statistics link roundup (3/17/13)

  1. A post on the Revolutions blog about an analysis of the worldwide email traffic patterns. The corresponding paper is also pretty interesting. The best part is the whole analysis was done in R. 
  2. A bill in California that would require faculty approved online classes to be given credit. I think this is potentially game changing if it passes - depending on who has to do the approving. If there is local control within departments it could be huge. On the other hand, as I'll discuss later this week, there is still some ground to be made up before I think MOOCs are ready for prime time credit in areas outside of the very basics.
  3. A pretty amazing blog post about a survival analysis of RuPaul's drag race. Via Hadley.
  4. If you are a statistician hiding under a rock you missed the NY Times messing up P-values.  The statistical blogosphere came out swinging. Gelman, Wasserman, Parker, etc.
  5. As a statistician who is pretty fired up about the tech community, I can get lost a bit in the hype as much as the next guy. I thought this article was pretty sobering. I think the way to make sure we keep innovating is having the will to fund long term companies and long term research. Look at how it paid off with Amazon...
  6. Broman on interactive graphics is worth a read. I agree that more of our graphics should be interactive, but there is an inherent tension/tradeoff in graphics, similar to the bias variance tradeoff. I'm sure there is a real word for it but it is the flexibility vs. understandability tradeoff. Too much interaction and its hard to see what is going on, not enough and you might as well have made a static graph.

Postdoctoral fellow position in reproducible research

We are looking to recruit a postdoctoral fellow to work on developing tools to make scientific research more easily reproducible. We're looking for someone who wants to work on (and solve!) real research problems in the biomedical sciences and address the growing need for reproducible research tools. The position would be in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and would be jointly advised by Jeff and myself.

Qualifications: PhD in statistics, biostatistics, computer science, or related field; strong programming skills in R and Perl/Python/C; excellent written and oral communication skills; serious moxie

Additional Information: Informal questions about the position can be sent to Dr. Roger Peng at rpeng @ Applications will be considered as they arrive.

To apply, send a cover letter describing your research interests and interest in the position, a CV, and the names of three references. In your application, please reference "Reproducible Research postdoctoral fellowship". Application materials should be emailed to Dr. Roger Peng at rpeng @

Applications from minority and female candidates are especially encouraged. Johns Hopkins University is an AA/EOE.


Here's my #ENAR2013 Wednesday schedule

Here are my picks for ENAR sessions today (Wednesday):

  • 8:30-10:15am: Large Data Visualization and Exploration, Grand Ballroom 4 (make sure you stay till the end to see Karl Broman); Innovative Methods in Causal Inference with Applications to Mediation, Neuroimaging, and Infectious Diseases, Grand Ballroom 8A; Next Generation Sequencing, Grand Ballroom 5
  • 10:30am-12:15pm: Statistical Information Integration of -Omics Data, Grand Ballrooms 1 & 2

Okay, so this schedule actually requires me to split myself in to three separate entities. However, if you find a way to do that, the 8:30-10:15am block is full of good stuff.

Have fun!


If I were at #ENAR2013 today, here's where I'd go

This week is the annual ENAR meeting, the big biostatistics conference, in Orlando, Florida. It actually started on Sunday but I haven't gotten around to looking at the program (obviously, I'm not there right now). Flipping through the program now, here's what looks good to me for Tuesday:

  • 8:30-10:15am: Functional Neuroimaging Decompositions, Grand Ballroom 3 
  • 10:30am-12:15pm: Hmm...I guess you should go to the Presidential Invited Address, Grand Ballroom 7
  • 1:45-3:30pm: JABES Showcase, Grand Ballroom 8A; Statistical Body Language: Analytical Methods for Wearable Computing, Grand Ballroom 4
  • 3:45-5:30pm: Big Data: Wearable Computing, Crowdsourcing, Space Telescopes, and Brain Imaging, Grand Ballroom 8A; Sample Size Planning for Clinical Development, Grand Ballroom 6

That's right, you can pack in two sessions on wearable computing today if you want. I'll post tomorrow for what looks good on Wednesday.


Sunday data/statistics link roundup (3/10/13)

  1. This is an outstanding follow up analysis to our paper on the rate of false discoveries in the medical literature. I hope that the author of the blog post will consider submitting it for publication in a journal, I think it is worth having more methodology out there in this area. 
  2. If you are an academic in statistics and aren't following Karl and Thomas on Twitter, you should be. Also check out Karl's (mostly) reproducible paper.
  3. An article in the WSJ that I think I received about 40 times this week. The online version has a quote from our own B-Caffo. It is a really good read. If you are into this, it seems like the interviews with Rebecca Nugent (where we discuss growing undergrad programs) and Joe Blitzstein where we discuss stats ed are relevant. I thought this quote was hugely relevant, "The bulk of the people coming out [with statistics degrees] are technically competent but they're missing the consultative and the soft skills, everything else they need to be successful" We are focusing heavily on both components of these skills in the grad program here at Hopkins - so if people are looking for awesome data people, just let us know!
  4. A cool discussion of how the A's look for players with "positive residuals" - positive value missed by the evaluations of other teams. (via Rafa)
  5. The physicist and the bikini model. If you haven't read it, you must be living under a rock. (via Alex N.)
  6. An interesting article about how IBM is using Watson to come up with new recipes based on the data from old recipes. I'm a little suspicious of the Spanish crescent though - no butter?!
  7. You should vote for Steven Salzberg for the Ben Franklin award. The dude has come up huge for open software and we should come up huge for him. Gotta vote today though.
  8. The Harlem Shake has killed more than one of my lunch hours. But this one is the best. By far. How all simulation studies should be done (via StatsChat).