Simply Statistics


Announcing the JHU Data Science Hackathon 2015

We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever JHU Data Science Hackathon (DaSH) on September 21-23, 2015 at the Baltimore Marriott Waterfront.

This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.

To get more details and to sign up for the hackathon, you can go to the DaSH web site. We will be posting more information as the event nears.


  • Jeff Leek
  • Brian Caffo
  • Roger Peng
  • Leah Jager


  • National Institutes of Health
  • Johns Hopkins University



stringsAsFactors: An unauthorized biography

Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like 'read.table()' and 'read.csv()' in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of

Why does stringsAsFactors not default to FALSE????

The argument 'stringsAsFactors' is an argument to the 'data.frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in 'read.table()' and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, 'stringsAsFactors' is set to TRUE.

This argument dates back to May 20, 2006 when it was originally introduced into R as the 'charToFactor' argument to 'data.frame()'. Soon afterwards, on May 24, 2006, it was changed to 'stringsAsFactors' to be compatible with S-PLUS by request from Bill Dunlap.

Most people I talk to today who use R are completely befuddled by the fact that 'stringsAsFactors' is set to TRUE by default. First of all, it should be noted that before the 'stringsAsFactors' argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn't want this behavior, you had to manually coerce each column to be character.

So here's the story:

In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by 'factor' vectors and so character columns got converted factor.

Why do we need factor variables to begin with? Because of modeling functions like 'lm()' and 'glm()'. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There's no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting 'stringsAsFactors = TRUE' when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.

There's also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like "disease" and "non-disease" will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn't have to repeat the strings "disease" and "non-disease" for as many observations that you had, which would have been wasteful.

Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use 'factors' for the modeling functions.

The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about 'stringsAsFactors' not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you're upset about 'stringsAsFactors = TRUE', then it's a pretty good indicator that you're either not a statistician by training, or you're doing non-traditional statistical things.

For example, in genomics, you might have the names of the genes in one column of data. It really doesn't make sense to encode these as factors because they won't be used in any modeling function. They're just labels, essentially. And because of CHARSXP hashing, you don't gain anything from an efficiency standpoint by converting them to factors either.

But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about 'stringsAsFactors'.

I fully expect that this blog post will now make all R users happy. If you think I've missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).


The Mozilla Fellowship for Science

This looks like an interesting opportunity for grad students, postdocs, and early career researchers:

We're looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.

Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.

Here's what you get:

Fellows will receive:

  • A stipend of $60,000 USD, paid in 10 monthly installments.
  • One-time health insurance supplement for Fellows and their families, ranging from $3,500 for single Fellows to $7,000 for a couple with two or more children.
  • One-time childcare allotment for families with children of up to $6,000.
  • Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.
  • All approved fellowship trips – domestic and international – are covered in full.

Deadline is August 14.


JHU, UMD researchers are getting a really big Big Data center

From Baltimore:

A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The $30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.

The web site has a pretty cool time-lapse video of the construction of the computing center. There's also a bit more detail at the JHU Hub site.


The Massive Future of Statistics Education

NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. 

Data are eating the world, but our collective ability to analyze data is going on a starvation diet.

Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions.

So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we are interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone.

In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college.

Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight.

Limited Capacity

The McKinsey Global Institute, in a highly cited report, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality.

Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of 730 people. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent.

On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in some data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for everyone to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time.

Enter the MOOCs

In April of 2014, Jeff Leek, Brian Caffo, and I launched the Johns Hopkins Data Science Specialization on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is

  1. Formulating a question that can be answered with data
  2. Assembling, cleaning, tidying data relevant to a question
  3. Exploring data, checking, eliminating hypotheses
  4. Developing a statistical model
  5. Making statistical inference
  6. Communicating findings
  7. Making the work reproducible

We took these basic steps and designed courses around each one of them.

Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month.

We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is what we do every single day. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work.

The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist.

For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with SwiftKey, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public.

Degree Alternatives

The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis.

All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses.

Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses.

Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs.

Early Numbers

So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project.

Some Early Lessons

From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned.

Data Science as Art and Science. Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower.

When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach?

Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his 1991 talk at the National Academies of Science, we had a process that we regularly espoused but barely understood.

Content vs. Curation. Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing?

Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a curation of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated.

Skill sets vs. Certification. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly.

In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing.


As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work.


Looks like this R thing might be for real

Not sure how I missed this, but the Linux Foundation just announced the R Consortium for supporting the "world’s most popular language for analytics and data science and support the rapid growth of the R user community". From the Linux Foundation:

The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.

Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.


How Airbnb built a data science team

From Venturebeat:

Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.

But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.


Interview at Leanpub

A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book R Programming for Data Science. So far, I've only published one book through Leanpub but I'm a huge fan. They've developed a system that is, in my opinion, perfect for academic publishing. The book's written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.

The full interview transcript is over at the Leanpub blog. If you want to listen to the audio of the interview, you can subscribe to the Leanpub podcast on iTunes.

R Programming for Data Science is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who've already bought a copy!


Interview with Class Central

Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.

See the full interview at the Class Central site. Below is short excerpt.


Figuring Out Learning Objectives the Hard Way

When building the Genomic Data Science Specialization (which starts in June!) we had to figure out the learning objectives for each course. We initially set our ambitions high, but as you can see in this video below, Steven Salzberg brought us back to Earth.