Monday data/statistics link roundup (2/10/14)

I'm going to try Monday's for the links. Let me know what you think.

  1. The Guardian is reading our blog. A week after Rafa posts that everyone should learn to code for career preparedness, the Guardian gets on the bandwagon.
  2. Nature Methods published a paper on a webtool for creating boxplots (via Simina B.). The nerdrage rivaled the quilt plot. I'm not opposed to papers like this being published, in fact it is an important part of making sure we don't miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy "innovative" curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you must cite the relevant software that generated the boxplot.
  3. This story about Databall (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed reading the paper the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the data are proprietary (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn't a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.
  4. The P-value curve for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the science-wise false discovery rate. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)
  5. Hopkins discovers bones are the best (via Michael R.).
  6. Awesome scientific diagrams in tex. Some of these are ridiculous.
  7. Mary Carillo goes crazy on backyard badminton. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)
  8. B'more Biostats has been on a tear lately. I've been reading Leo's post on uploading files to Dropbox/Google drive from R, Mandy's post explaining quantitative MRI, Yenny's post on data sciences, John's post on graduate school open houses, and Alyssa's post on vectorization. If you like Simply Stats you should be following them on Twitter here.
Posted in Uncategorized | 1 Comment

Just a thought on peer reviewing - I can't help myself.

Today I was thinking about reviewing, probably because I was handling a couple of papers as AE and doing tasks associated with reviewing several other papers. I know that this is idle thinking, but suppose peer review was just a drop down ranking with these 6 questions.

  1. How close is this paper to your area of expertise?
  2. Does the paper appear to be technically right?
  3. Does the paper use appropriate statistics/computing?
  4. Is the paper interesting to people in your area?
  5. Is the paper interesting to a broad audience?
  6. Are the appropriate data and code available?

Each question would be rated on a 1-5 star scale. 1 stars = completely inadequate, 3 stars = acceptable, 5 stars = excellent. There would be an optional comments box that would only be used for major/interesting thoughts and anything that got above 3 stars for questions 2, 3, and 6 was published. Incidentally, you could do this for free on Github if the papers were written in markdown, that would reduce the substantial costs of open-access publishing.

No doubt peer review would happen faster this way. I was wondering, would it be any worse?



Posted in Uncategorized | 11 Comments

My Online Course Development Workflow

One of the nice things about developing 9 new courses for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.

The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it's nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn't work, you can start over the next day and have another version done in less than 24 hours.

I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.

  • I use Camtasia for Mac to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it's pretty much glorified screencasting software for me.
  • For talking head type videos I use my iPhone 5S mounted on a tripod. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see here for a much better example of what can be done). I attach the phone to an Apogee microphone to pick up better sound. For some of the interviews that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).
  • To record my primary sound (i.e. me talking), I use the Zoom H4N portable recorder. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.
  • For simple sound recording (no video or screen) I use Audacity.
  • All of my lecture videos are run through Final Cut Pro X on my 15-inch MacBook Pro with Retina Display. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you're not used to a nonlinear editor (as I was not). I bought this excellent book to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.
  • I have an 8TB Western Digital Thunderbolt drive to store the raw video for all my classes (and some backups). I also use two 1TB Thunderbolt drives to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.
  • Finished videos are shared with a Dropbox for Business account so that Jeff, Brian, and I can all look at each other's stuff. Videos are exported to H.264/AAC and uploaded to Coursera.
  • For developing slides, Jeff, Brian, and I have standardized around using Slidify. The beauty of using slidify is that it lets you write everything in Markdown, a super simple text format. It also make it simpler to manage all the course material in Git/GitHub because you don't have to lug around huge PowerPoint files. Everything is  a light-weight text file. And thanks to Ramnath's incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via GitHub Pages).

The first courses for the Data Science Specialization start on April 7th. Don't forget to sign up!

Posted in Uncategorized | Tagged , | 7 Comments

The three tables for genomics collaborations

Collaborations between biologists and statisticians are very common in genomics. For the data analysis to be fruitful, the statistician needs to understand what samples are being analyzed. For the analysis report to make sense to the biologist, it needs to be properly annotated with information such as gene names, genomic location, etc... In a recent post, Jeff laid out his guide for such collaborations, here I describe an approach that has helped me in mine.

In many of my past collaborations, sharing the experiment's key information,  in a way that facilitates data analysis, turned out to be more time consuming than the analysis itself. One reason is that the data producers annotated samples in ways that was imposible to decipher without direct knowledge of the experiment (e.g using lab specific codes in the filenames, or colors in excel files).  In the early days of microarrays, a group of researchers noticed this problem and created a markup language to describe and communicate information about  experiments electronically.

The Bioconductor project took a less ambitious  approach and created classes specifically designed to store the minimal information needed to perform an analysis. These classes can be thought of as three tables, stored as flat text files, all of which are relatively easy to create for biologists.

The first table contains the experimental data with rows representing features (e.g. genes) and the columns representing samples.

The second table contains the sample information. This table contains a row for each column in the experimental data table. This table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the columns of the first table. The second contains the main outcome of interest, e.g. case or control, cancer or normal. Other commonly included columns are the filename of the original raw data associated with each row, the date the experiment was processed, and other information about the samples.

The third table contains the feature information. This table contains a row for each row in the experimental table. The table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the row of the first table. The second will contain an annotation that makes sense to biologists, e.g. a gene name. For technologies that are widely used (e.g. Affymetrix gene expression arrays) these table are readily available.

With these three relatively simple files in place less time is spent "figuring out" the data and the statisticians can focus their energy on data analysis while the biologists can focus their energy on interpreting the results. This approach seems to have been the inspiration for the MAGE-TAB format.

Note that with newer technologies, statisticians prefer to get access to the raw data. In this case, instead of an experimental data table (table 1), they will want the original raw data files. The sample information then must contain a column with the filenames so that sample annotation can be properly matched.

NB: These three tables are not a complete description of an experiment and are not intended as an alternative to standards such as MAGE and MIAME. But in many cases, they provide the very minimum information needed to carry out a rudimentary analysis. Note that Bioconductor provides tools to import information from MAGE-ML and other related formats.

Posted in Uncategorized | 1 Comment

Not teaching computing and statistics in our public schools will make upward mobility even harder

In his book Average Is Over, Tyler Cowen predicts that as automatization becomes more common, modern economies will eventually be composed of two groups: 1) a highly educated minority involved in the production of  automated services and 2) a vast majority earning very little but enough to consume some of the low-priced products created by group 1.  Not everybody will agree with this view, but we can't ignore the fact that automatization has already eliminated many middle class jobs in, for example, manufacturing and the automotive industries. New technologies, such as driverless cars and online retailers, will very likely eliminate many more jobs (e.g. drivers and retail clerks) than they create (programmers and engineers).

Computer literacy is essential for working with automatized systems. Programming and learning from data are perhaps the most useful skill for creating these systems. Yet the current default curriculum includes neither computer science nor statistics. At the same time, there are plenty of resources for motivated parents with means to get their children to learn these subjects. Kids whose parents don't have the wherewithal to take advantage of these educational resources will be at an even greater disadvantage than they are today. This disadvantage is made worse by the fact that many of the aforementioned resources are free and open to the world  (CodeacademyKhan AcademyEdX, and Coursera for example) which means that a large pool of students that previously had no access to this learning material will also be competing for group 1 jobs. If we want to level the playing field we should start by updating the public school curriculum so that, in principle, everybody has the opportunity to compete.

Posted in Uncategorized | 3 Comments

Announcing the Release of swirl 2.0

Editor's note: This post was written by Nick Carchedi, a Master's degree student in the Department of Biostatistics at Johns Hopkins. He is working with us to develop the Data Science Specialization as well as software for interactive learning of R and statistics.

Official swirl website:

On September 27, 2013, I wrote a guest blog post on Simply Statistics to announce the creation of Statistics with Interactive R Learning (swirl), an R package for teaching and learning statistics and R simultaneously and interactively. Over the next several months, I received a tremendous amount of feedback from all over the world. Two things became clear: 1) there were many opportunities for improvement on the original design and 2) there's an incredible demand globally for new and better ways of learning statistics and R.

In the spirit of R and open source software, I shared the source code for swirl on GitHub. As a result, I quickly came in contact with several very talented individuals, without whom none of what I'm about to share with you would have been possible. Armed with invaluable feedback and encouragement from early adopters of swirl 1.0, my new team and I pursued a complete overhaul of the original design.

Today, I'm happy to announce the result of our efforts: swirl 2.0.

Like the first version of the software, swirl 2.0 guides students through interactive tutorials in the R console on a variety of topics related to statistics and R. The user selects from a menu of courses, each of which is broken up by topic into shorter lessons. Lessons, in turn, are a dialog between swirl and the user and are composed of text output, multiple choice and text-based questions, and (most importantly) questions that require the user to enter actual R code at the prompt. Responses are evaluated for correctness based on instructor-specified answer tests and appropriate feedback is given immediately to the user.

It's helpful to think of swirl as the synthesis of two separate parts: content and platform. Content is authored by instructors in R Markdown files. The platform is then responsible for delivering this content to the user and interpreting the user's responses in an interactive and engaging way.

Our primary focus for swirl 2.0 was to build a more robust and extensible platform for delivering content. Here's a (nontechnical) summary of new and revised features:

  • A library of answer tests an instructor can deploy to check user input for correctness
  • If stuck, a user can skip a question, causing swirl to enter the correct answer on their behalf
  • During a lesson, a user can pause instruction to play around or practice something they just learned, then use a special keyword to regain swirl's attention when ready to resume
  • swirl "sees" user input the same way R "sees" it, which allows swirl to understand the composition of a user's input on a much deeper level (thanks, Hadley)
  • User progress is saved between sessions
  • More readable output that adjusts to the width of the user's display (thanks again, Hadley)
  • Extensible framework allows others to easily extend swirl's functionality
  • Instructors can author content in a special flavor of R markdown

(For a more technical understanding of swirl's features and inner workings, we encourage readers to consult our GitHub repository.)

Although improving the platform was our first priority for this release, we've made some improvements to existing content and, more importantly, added the beginnings of a new course: Intro to R. Intro to R is our response to the overwhelming demand for a more accessible and interactive way to learn the R language. We've included the first three lessons of the course and plan to add many more over the coming months as our focus turns to creating more high quality content.

Our ultimate goal is to have the statistics and R communities use swirl as a platform to deliver their own content to students interactively. We've heard from many people who have an interest in creating their own content and we're working hard to make the process of creating content as simple and enjoyable as possible.

The goal of swirl is not to be flashy, but rather to provide the most authentic learning environment possible. We accomplish this by placing students directly on the R prompt, within the very same environment they'll use for data analysis when they are not using swirl. We hope you find swirl to be a valuable tool for learning and teaching statistics and R.

It's important to stress that, as with any new software, we expect there will be bugs. At this point, users should still consider themselves "early adopters". For bug reports, suggested enhancements, or to learn more about swirl, please visit our website.


Many people have contributed to this project, either directly or indirectly, since its inception. I will attempt to list them all here, in no particular order. I'm sincerely grateful to each and everyone one of you.

  • Bill & Gina: swirl is as much theirs as it is mine at this point. Their contributions are the only reason the project has evolved so much since the release of swirl 1.0.
  • Brian: Challenged me to turn my idea for swirl into a working prototype. Coined the "swirl" acronym. swirl would still be an idea in my head without his encouragement.
  • Jeff: Pushes me to think big picture and provides endless encouragement. Reminds me that a great platform is worthless without great content.
  • Roger: Encouraged me to separate platform and content, a key paradigm that allowed swirl to mature from a messy prototype to something of real value. Introduced me to Git and GitHub.
  • Lauren & Ethan: Helped with development of the earliest instructional content.
  • Ramnath: Provided a model for content authoring via slidify "flavor" of R Markdown.
  • Hadley: Made key suggestions for improvement and provided an important proof of concept. His work has had a profound influence on swirl's development.
  • Peter: Our discussions led to a better understanding of some key ideas behind swirl 2.0.
  • Sally & Liz: Beta testers and victims of my endless rants during stats tutoring sessions.
  • Kelly: Most talented graphic designer I know and mastermind behind the swirl logo. First line of defense against bad ideas, poor design, and crappy websites. Visit her website.
  • Mom & Dad: Beta testers and my #1 fans overall.
Posted in Uncategorized | Leave a comment

Marie Curie says stop hating on quilt plots already.

"There are sadistic scientists who hurry to hunt down error instead of establishing the truth." -Marie Curie (

Thanks to Kasper H. for that quote. I think it is a perfect fit for today's culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that's it. This simplification of R's heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.

Over the course of the next several weeks quilt plots went viral. Here are a few example tweets. It was also widely trashed on people's blogs and even in the scientist. So I did an experiment. I built a table of frequencies in R like this and applied the heatmap function in R, then the quilt.plot function in fields, then the function written by the authors of the paper with as minimal tweeking as possible.

x = matrix(rbinom(25,size=4,prob=0.5),nrow=5)
pt = prop.table(x)

Here are the results:







It is clear that out of the box and with no tinkering, the new plot makes something nicer/more interpretable. The columns/rows are where I expect and the scale is there and nicely labeled. Everyone who has ever made heatmaps in R has some bit of code that looks like this:


To hack together a heatmap in R that looks like you expect. It is a total pain. Obviously the quilt plot paper has a few flaws:

  1. It tries to introduce the quilt plot as a new idea.
  2. It doesn't just come out and say it is a hack of the heatmap function, but tries to dance around it.
  3. It produces code, but only as images in word files. I had to retype the code to make my plot.

That being said here are a couple of other true things about the paper:

  1. The code works if you type it out and apply it.
  2. They produced code.
  3. The paper is open access.
  4. The paper is correct technically.
  5. The hack is useful for users with few R skills.

So why exactly isn't it a paper? It smacks of academic elitism to claim that this isn't good enough because it isn't a "new idea". Not every paper discovers radium. Some papers are better than others and that is ok. I think the quilt plot being published isn't a problem, maybe I don't like the way it is written exactly, but they do acknowledge the heat map, they do produce correct, relevant code, and it does solve a problem people actually have. That is better than a lot of papers that appear in more prestigious journals. Arsenic life anyone?

I think it is useful to have a forum where people can post correct, useful, but not necessarily ground breaking results and get credit for them, even if the credit is modest. Otherwise we might miss out on useful bits of code. Frank Harrell has a bunch of functions that tons of people use but he doesn't get citations, you probably have heard of the Hmisc package if you use R.

But did you know Karl Broman has a bunch of really useful functions in his personal R package, qqline2 is great. I know Rafa has a bunch of functions he has never published because they seem "too trivial" but I use them all the time. Every scientist who touches code has a personal library like this. I'm not saying the quilt plot is in that category. But I am saying that it is stupid not to have a public forum for making these functions available to other scientists. But that won't happen if the "quilt plot backlash" is what people see when they try to get published credit for simple code that solves real problems.

Hacks like the quilt plot can help people who aren't comfortable with R write reproducible scripts without having to figure out every plotting parameter. Keeping in mind that the vast majority of data analysis is not done by statisticians, it seems like these little hacks are an important part of science. If you believe in figshare, github, open science, and shareable code, you shouldn't be making fun of the quilt plotters.

Marie Curie says so.

Posted in Uncategorized | 6 Comments

The Johns Hopkins Data Science Specialization on Coursera

We are very proud to announce the the Johns Hopkins Data Science Specialization on Coursera. You can see the official announcement from the Coursera folks here. This is the main reason Simply Statistics has been a little quiet lately.

The three of us (Brian Caffo, Roger Peng, and Jeff Leek) along with a couple of incredibly hard working graduate students (Nick Carchedi of swirl fame and Sean Kross) have put together nine new one-month classes to run on the Coursera platform. The classes are:

  1. The Data Scientist's Toolbox  - A basic introduction to data and data science and a  basic guide to R/Rstudio/Github/Command Line Interface.
  2. R Programming  - Introduction to R programming, from installing R to types, to functions, to control structures.
  3. Getting and Cleaning Data - An introduction to getting data from the web, from images, from APIs, and from databases. The course also covers how to go from raw data to tidy data.
  4. Exploratory Data Analysis - This course covers plotting in base graphics, lattice, ggplot2 and clustering and other exploratory techniques. It also covers how to think about exploring data you haven't seen.
  5. Reproducible Research  - This is one of the unique courses to our sequence. It covers how to think about reproducible research, evidence based data analysis, reproducible research checklists and knitr, markdown, R markdown, etc.
  6. Statistical Inference  - This course covers the fundamentals of statistical inference from a practical perspective. The course covers both the technical details and important ideas like confounding.
  7. Regression Models  - This course covers the fundamentals of linear and generalized linear regression modeling. It also serves as an introduction to how to "think about" relating variables to each other quantitatively.
  8. Practical Machine Learning  - This course will cover the basic conceptual ideas in machine learning like in/out of sample errors, cross validation, and training and test sets. It will also cover a range of machine learning algorithms and their practical implementation.
  9. Developing Data Products  - This course will cover how to develop tools for communicating data, methods, and analyses with other people. It will cover building R packages, Shiny, and Slidify, among other things.

There will also be a specialization project - consisting of a 10th class where students will work on projects conducted with industry, government, and academic partners.

The classes represent some of the content we have previously covered in our popular Coursera classes and a ton of brand new content for this specialization. Here are some things that I think make our program stand out:

  • We will roll out 3 classes at a time starting in April. Once a class is running, it will run every single month concurrently.
  • The specialization offers a bunch of unique content, particularly in the courses Getting and Cleaning Data, Reproducible Research, and Developing Data Products.
  • All of the content is being developed open source and open-access on Github. You are welcome to check it out as we develop it and contribute!
  • You can take the first 9 courses of the specialization entirely for free.
  • You can choose to pay a very modest fee to get "Signature Track" certification in every course.

I have also created a little page that summarizes some of the unique aspects of our program. Scroll through it and you'll find sharing links at the bottom. Please share with your friends, we think this is pretty cool:

Posted in Uncategorized | 22 Comments

Sunday data/statistics link roundup (1/19/2014)

  1. Tesla is hiring a data scientist. That is all.
  2. I'm not sure I buy the idea that Python is taking over for R among people who actually do regular data science.  I think it is still context dependent. A huge fraction of genomics happens in R and there is a steady stream of new packages that allow R users to push farther and farther back into the processing pipeline. On the other hand, I think language diversity is clearly a plus for someone who works with data. Not that I'd know...
  3. This is an awesome talk on why to pursue a Ph.D.. It gives a really level headed and measured discussion, specifically focused on computational programs (I think I got to it via Alyssa F.'s blog).
  4. En Español - A blog post about a study of genetic risk factors among Hispanic/Latino populations (via Rafa).
  5. Where have all the tenured women gone? This is a major issue and deserves much more press than it gets (via Sherri R.).
  6. Not related to statistics really, but these image captures from Google streetview are wild. 
Posted in Uncategorized | 4 Comments

Missing not at random data makes some Facebook users feel sad

This article, published last week, explained how "some younger users of Facebook say that using the site often leaves them feeling sad, lonely and inadequate".  Being a statistician  gives you an advantage here because we know that naive estimates from missing not at random (MNAR) data can be very biased. The posts you see on Facebook are not a random sample from your friends' lives. We see pictures of their vacations,  abnormally flattering pictures of themselves, reports on their major achievements, etc...  but no view of  the mundane typical daily occurrences. Here is a simple cartoon explanation of how MNAR data can give you a biased view of whats really going on. Suppose your life occurrences are rated from 1 (worst) to 5 (best), this table compares what you see to what is really going on after 15 occurrences:

Screen Shot 2014-01-17 at 10.16.32 AM

Posted in Uncategorized | 1 Comment