Simply Statistics


Teaser trailer for the Genomic Data Science Specialization on Coursera


We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the "Genomic Data Science Specialization" and includes a spectacular line up of instructors: Steven Salzberg, Ela Pertea, James Taylor, Liliana Florea, Kasper Hansen, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!


A surprisingly tricky issue when using genomic signatures for personalized medicine

My student Prasad Patil has a really nice paper that just came out in Bioinformatics (preprint in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.

An example of this normalization is to mean-center the measurements for each gene in the testing/application stage, then apply the prediction rule. The problem is that if you use a different set of samples when calculating the mean you can get a totally different prediction function. The basic problem is illustrated in this graphic.


Screen Shot 2015-03-19 at 12.58.03 PM


This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set. In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.

Screen Shot 2015-03-19 at 1.02.25 PM


This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.


A simple (and fair) way all statistics journals could drive up their impact factor.


If every method in every stats journal was implemented in a corresponding R package (easy), was required to have a  companion document that was a tutorial on how to use the software (easy), included a reference to how to cite the paper if you used the software (easy) and the paper/tutorial was posted to the relevant message boards for the communities of interest (easy) that journal would see a dramatic bump in its impact factor.


Data science done well looks easy - and that is a big problem for data scientists

Data science has a ton of different definitions. For the purposes of this post I'm going to use the definition of data science we used when creating our Data Science program online. Data science is:

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

In general the data science process is iterative and the different components blend together a little bit. But for simplicity lets discretize the tasks into the following 7 steps:

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results
  7. Make your analysis reproducible

A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst's time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don't need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist's usually instead try to go back and get the right data.

The result of this process is that most well executed and successful data science projects don't (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I've evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren't quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn't know about the data collection and cleaning process or doesn't care.

This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think "oh that is easy" or "that is just a trivial statistical/machine learning model" and don't see all of the work that goes into solving the real problems in data science. A concrete example of this is in academic statistics. It is customary for people to show theorems in their talks and maybe even some of the proof. This gives people working on theoretical projects an opportunity to "show their stuff" and demonstrate how good they are. The equivalent for a data scientist would be showing how they found and cleaned multiple data sets, merged them together, checked for biases, and arrived at a simplified data set. Showing the "proof" would be equivalent to showing how they matched IDs. These things often don't look nearly as impressive in talks, particularly if the audience doesn't have experience with how incredibly delicate real data analysis is. I imagine versions of this problem play out in industry as well (candidate X did a good analysis but it wasn't anything special, candidate Y used Hadoop to do BIG DATA!).

The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really "hard" and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc.  In the meantime, don't be fooled by what looks like simple data science - it can often be pretty effective.


Editor's note: If you like this post, you might like my pay-what-you-want book Elements of Data Analytic Style:



De-weaponizing reproducibility

A couple of weeks ago Roger and I went to a conference on statistical reproducibility held at the National Academy of Sciences. The discussion was pretty wide ranging and I love that the thinking about reproducibility is coming back to statistics. There was pretty widespread support for the idea that prevention is the right way to approach reproducibility.
It turns out I was the last speaker of the whole conference. This is an unenviable position to be in with so many bright folks speaking first as they covered a huge amount of what I wanted to say. My talk focused on three key points:
  1. The tools for reproducibility already exist, the barrier isn't tools
  2. We need to de-weaponize reproducibility
  3. Prevention is the right approach to reproducibility


In terms of the first point, tools like iPython, knitr, and Galaxy can be used to all but the absolutely largest analysis reproducible right now.  Our group does this all the time with our papers and so do many others. The problem isn't a lack of tools.

Speaking to point two, I think many people would agree that part of the issue is culture change. One issue that is increasingly concerning to me is the "weaponization" of reproducibility.  I have been noticing is that some of us (like me, my students, other folks at JHU, and lots of particularly junior computational people elsewhere) are trying really hard to be reproducible. Most of the time this results in really positive reactions from the community. But when a co-author of mine and I wrote that paper about the science-wise false discovery rate, one of the discussants used our code (great), improved on it (great), identified a bug (great), and then did his level best to humiliate us both in front of the editor and the general public because of that bug (not so great).

I have seen this happen several times. Most of the time if a paper is reproducible the authors get a pat on the back and their code is either ignored, or used in a positive way. But for high-profile and important problems, people  largely use eproducibility to:
  1.  Impose regulatory hurdles in the short term while people transition to reproducibility. One clear example of this is the Secret Science Reform Act which is a bill that imposes strict reproducibility conditions on all science before it can be used as evidence for regulation.
  2. Humiliate people who aren't good coders or who make mistakes in their code. This is what happened in my paper when I produced reproducible code for my analysis, but has also happened to other people.
  3. Take advantage of people's code to plagiarize/straight up steal work. I have stories about this I'd rather not put on the internet


Of the three, I feel like (1) and (2) are the most common. Plagiarism and scooping by theft I think are actually relatively rare based on my own anecdotal experience. But I think that the "weaponization" of reproducibility to block regulation or to humiliate folks who are new to computational sciences is more common than I'd like it to be. Until reproducibility is the standard for everyone - which I think is possible now and will happen as the culture changes -  the people who are the early adopters are at risk of being bludgeoned with their own reproducibility. As a community, if we want widespread reproducibility adoption we have to be ferocious about not allowing this to happen.


The elements of data analytic style - so much for a soft launch

Editor's note: I wrote a book called Elements of Data Analytic Style. Buy it on Leanpub or Amazon! If you buy it on Leanpub, you get all updates (there are likely to be some) for free and you can pay what you want (including zero) but the author would be appreciative if you'd throw a little scratch his way. 

So uh, I was going to soft launch my new book The Elements of Data Analytic Style yesterday. I figured I'd just quietly email my Coursera courses to let them know I created a new reference. It turns out that that wasn't very quiet. First this happened:


and sure enough the website was down:


Screen Shot 2015-03-02 at 2.14.05 PM



then overnight it did something like 6,000+ units:





So lesson learned, there is no soft open with Coursera. Here is the post I was going to write though:


### Post I was gonna write

I have been doing data analysis for something like 10 years now (gulp!) and teaching data analysis in person for 6+ years. One of the things we do in my data analysis class at Hopkins is to perform a complete data analysis (from raw data to written report) every couple of weeks. Then I grade each assignment for everything from data cleaning to the written report and reproducibility. I've noticed over the course of teaching this class (and classes online) that there are many common elements of data analytic style that I don't often see in textbooks, or when I do, I see them spread across multiple books.

I've posted on some of these issues in some open source guides I've posted to Github like:

But I decided that it might be useful to have a more complete guide to the "art" part of data analysis. One goal is to summarize in a succinct way the most common difficulties encountered by practicing data analysts. It may be a useful guide for peer reviewers who could refer to section numbers when evaluating manuscripts, for instructors who have to grade data analyses, as a supplementary text for a data analysis class, or just as a useful reference. It is modeled loosely in format and aim on the Elements of Style by William Strunk. Just as with the EoS, both the checklist and my book cover a small fraction of the field of data analysis, but my experience is that once these elements are mastered, data analysts benefit most from hands on experience in their own discipline of application, and that many principles may be non-transferable beyond the basics. But just as with writing, new analysts would do better to follow the rules until they know them well enough to violate them.

The book includes a basic checklist that may be useful as a guide for beginning data analysts or as a rubric for evaluating data analyses. I'm reproducing it here so you can comment/hate/enjoy on it.


The data analysis checklist

This checklist provides a condensed look at the information in this book. It can be used as a guide during the process of a data analysis, as a rubric for grading data analysis projects, or as a way to evaluate the quality of a reported data analysis.
I Answering the question

1. Did you specify the type of data analytic question (e.g. exploration, assocation causality) before touching the data?
2. Did you define the metric for success before beginning?
3. Did you understand the context for the question and the scientific or business application?
4. Did you record the experimental design?
5. Did you consider whether the question could be answered with the available data?

II Checking the data

1. Did you plot univariate and multivariate summaries of the data?
2. Did you check for outliers?
3. Did you identify the missing data code?

III Tidying the data

1. Is each variable one column?
2. Is each observation one row?
3. Do different data types appear in each table?
4. Did you record the recipe for moving from raw to tidy data?
5. Did you create a code book?
6. Did you record all parameters, units, and functions applied to the data?

IV Exploratory analysis

1. Did you identify missing values?
2. Did you make univariate plots (histograms, density plots, boxplots)?
3. Did you consider correlations between variables (scatterplots)?
4. Did you check the units of all data points to make sure they are in the right range?
5. Did you try to identify any errors or miscoding of variables?
6. Did you consider plotting on a log scale?
7. Would a scatterplot be more informative?

V Inference

1. Did you identify what large population you are trying to describe?
2. Did you clearly identify the quantities of interest in your model?
3. Did you consider potential confounders?
4. Did you identify and model potential sources of correlation such as measurements over time or space?
5. Did you calculate a measure of uncertainty for each estimate on the scientific scale?

VI Prediction

1. Did you identify in advance your error measure?
2. Did you immediately split your data into training and validation?
3. Did you use cross validation, resampling, or bootstrapping only on the training data?
4. Did you create features using only the training data?
5. Did you estimate parameters only on the training data?
6. Did you fix all features, parameters, and models before applying to the validation data?
7. Did you apply only one final model to the validation data and report the error rate?

VII Causality

1. Did you identify whether your study was randomized?
2. Did you identify potential reasons that causality may not be appropriate such as confounders, missing data, non-ignorable dropout, or unblinded experiments?
2. If not, did you avoid using language that would imply cause and effect?

VIII Written analyses

1. Did you describe the question of interest?
2. Did you describe the data set, experimental design, and question you are answering?
3. Did you specify the type of data analytic question you are answering?
4. Did you specify in clear notation the exact model you are fitting?
5. Did you explain on the scale of interest what each estimate and measure of uncertainty means?
6. Did you report a measure of uncertainty for each estimate on the scientific scale?

IX Figures

1. Does each figure communicate an important piece of information or address a question of interest?
2. Do all your figures include plain language axis labels?
3. Is the font size large enough to read?
4. Does every figure have a detailed caption that explains all axes, legends, and trends in the figure?

X Presentations

1. Did you lead with a brief, understandable to everyone statement of your problem?
2. Did you explain the data, measurement technology, and experimental design before you explained your model?
3. Did you explain the features you will use to model data before you explain the model?
4. Did you make sure all legends and axes were legible from the back of the room?

XI Reproducibility

1. Did you avoid doing calculations manually?
2. Did you create a script that reproduces all your analyses?
3. Did you save the raw and processed versions of your data?
4. Did you record all versions of the software you used to process the data?
5. Did you try to have someone else run your analysis code to confirm they got the same answers?

XI R packages

1. Did you make your package name "Googleable"
2. Did you write unit tests for your functions?
3. Did you write help files for all functions?
4. Did you write a vignette?
5. Did you try to reduce dependencies to actively maintained packages?
6. Have you eliminated all errors and warnings from R CMD CHECK?



Navigating Big Data Careers with a Statistics PhD

Editor's note: This is a guest post by Sherri Rose. She is an Assistant Professor of Biostatistics in the Department of Health Care Policy at Harvard Medical School. Her work focuses on nonparametric estimation, causal inference, and machine learning in health settings. Dr. Rose received her BS in statistics from The George Washington University and her PhD in biostatistics from the University of California, Berkeley, where she coauthored a book on Targeted Learning. She tweets @sherrirose.

A quick scan of the science and technology headlines often yields two words: big data. The amount of information we collect has continued to increase, and this data can be found in varied sectors, ranging from social media to genomics. Claims are made that big data will solve an array of problems, from understanding devastating diseases to predicting political outcomes. There is substantial “big data” hype in the press, as well as business and academic communities, but how do upcoming, current, and recent statistical science PhDs handle the array of training opportunities and career paths in this new era? Undergraduate interest in statistics degrees is exploding, bringing new talent to graduate programs and the post-PhD job pipeline.  Statistics training is diversifying, with students focusing on theory, methods, computation, and applications, or a blending of these areas. A few years ago, Rafa outlined the academic career options for statistics PhDs in two posts, which cover great background material I do not repeat here. The landscape for statistics PhD careers is also changing quickly, with a variety of companies attracting top statistics students in new roles.  As a new faculty member at the intersection of machine learning, causal inference, and health care policy, I've already found myself frequently giving career advice to trainees.  The choices have become much more nuanced than just academia vs. industry vs. government.

So, you find yourself inspired by big data problems and fascinated by statistics. While you are a student, figuring out what you enjoy working on is crucial. This exploration could involve engaging in internship opportunities or collaborating with multiple faculty on different types of projects. Both positive and negative experiences can help you identify your preferences.

Undergraduates may wish to spend a couple months at a Summer Institute for Training in Biostatistics or National Science Foundation Research Experience for Undergraduates. There are also many MOOC options to get a taste of different areas ofstatistics. Selecting a graduate program for PhD study can be a difficult choice, especially when your interests within statistics have yet to be identified, as is often the case for undergraduates. However, if you know that you have interests in software and programming, it can be easy to sort which statistical science PhD programs have a curricular or research focus in this area by looking at department websites. Similarly, if you know you want to work in epidemiologic methods, genomics, or imaging, specific programs are going to jump right to the top as good fits. Getting advice from faculty in your department will be important. Competition for admissions into statistics and biostatistics PhD programs has continued to increase, and most faculty advise applying to as many relevant programs as is reasonable given the demands on your time and finances. If you end up sitting on multiple (funded) offers come April, talking to current students, student alums, and looking at alumni placement can be helpful. Don't hesitate to contact these people, selectively. Most PhD programs genuinely do want you to end up in the place that is best for you, even if it is not with them.

Once you're in a PhD program, internship opportunities for graduate students are listed each year by the American Statistical Association. Your home department may also have ties with local research organizations and companies with openings. Internships can help you identify future positions and the types of environments where you will flourish in your career. Lauren Kunz, a recent PhD graduate in biostatistics from Harvard University, is currently a Statistician at the National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health. Dr. Kunz said, "As a previous summer intern at the NHLBI, I was able to get a feel for the day to day life of a biostatistician at the NHLBI. I found the NHLBI Office of Biostatistical Research to be a collegial, welcoming environment, and I soon learned that NHLBI biostatisticians have the opportunity to work on a variety of projects, very often collaborating with scientists and clinicians. Due to the nature of these collaborations, the biostatisticians are frequently presented with scientifically interesting and important statistical problems. This work often motivates methodological research which in turn has immediate, practical applications. These factors matched well with my interest in collaborative research that is both methodological and applied."

Industry is also enticing to statistics PhDs, particularly those with an applied or computational focus, like Stephanie Sapp and Alyssa Frazee. Dr. Sapp has a PhD in statistics from the University of California, Berkeley, and is currently a Quantitative Analyst at Google. She also completed an internship there the summer before she graduated. In commenting about her choice to join Google, Dr. Sapp said,  "I really enjoy both academic research and seeing my work used in practice.  Working at Google allows me to continue pursuing new and interesting research topics, as well as see my results drive more immediate impact."  Dr. Frazee just finished her PhD in biostatistics at Johns Hopkins University and previously spent a summer exploring her interests in Hacker School.  While she applied to both academic and industry positions, receiving multiple offers, she ultimately chose to go into industry and work for Stripe: "I accepted a tech company's offer for many reasons, one of them being that I really like programming and writing code. There are tons of opportunities to grow as a programmer/engineer at a tech company, but building an academic career on that foundation would be more of a challenge. I'm also excited about seeing my statistical work have more immediate impact. At smaller companies, much of the work done there has visible/tangible bearing on the product. Academic research in statistics is operating a lot closer to the boundaries of what we know and discovering a lot of cool stuff, which means researchers get to try out original ideas more often, but the impact is less immediately tangible. A new method or estimator has to go through a lengthy peer review/publication process and be integrated into the community's body of knowledge, which could take several years, before its impact can be fully observed."  One of Dr. Frazee, Dr. Sapp, and Dr. Kunz's considerations in choosing a job reflects many of those in the early career statistics community: having an impact.

Interest in both developing methods and translating statistical advances into practice is a common theme in the big data statistics world, but not one that always leads to an industry or government career. There are also academic opportunities in statistics, biostatistics, and interdisciplinary departments like my own where your work can have an impact on current science.  The Department of Health Care Policy (HCP) at Harvard Medical School has 5 tenure-track/tenured statistics faculty members, including myself, among a total of about 20 core faculty members. The statistics faculty work on a range of theoretical and methodological problems while collaborating with HCP faculty (health economists, clinician researchers, and sociologists) and leading our own substantive projects in health care policy (e.g., Mass-DAC). I find it to be a unique and exciting combination of roles, and love that the science truly informs my statistical research, giving it broader impact. Since joining the department a year and a half ago, I've worked in many new areas, such as plan payment risk adjustment methodology. I have also applied some of my previous work in machine learning to predicting adverse health outcomes in large datasets. Here, I immediately saw a need for new avenues of statistical research to make the optimal approach based on statistical theory align with an optimal approach in practice. My current research portfolio is diverse; example projects include the development of a double robust estimator for the study of chronic disease, leading an evaluation of a new state-wide health plan initiative, and collaborating with department colleagues on statistical issues in all-payer claims databases, physician prescribing intensification behavior, and predicting readmissions. The larger statistics community at Harvard also affords many opportunities to interact with statistics faculty across the campus, and university-wide junior faculty events have connected me with professors in computer science and engineering. I feel an immense sense of research freedom to pursue my interests at HCP, which was a top priority when I was comparing job offers.

Hadley Wickam, of ggplot2 and Advanced R fame, took on a new role as Chief Scientist at RStudio in 2013. Freedom was also a key component in his choice to move sectors: "For me, the driving motivation is freedom: I know what I want to work on, I just need the freedom (and support) to work on it. It's pretty unusual to find an industry job that has more freedom than academia, but I've been noticeably more productive at RStudio because I don't have any meetings, and I can spend large chunks of time devoted to thinking about hard problems. It's not possible for everyone to get that sort of job, but everyone should be thinking about how they can negotiate the freedom to do what makes them happy. I really like the thesis of Cal Newport's book So Good They Can't Ignore You - the better you are at your job, the greater your ability to negotiate for what you want."

There continues to be a strong emphasis in the work force on the vaguely defined field of “data science,” which incorporates the collection, storage, analysis, and interpretation of big data.  Statisticians not only work in and lead teams with other scientists (e.g., clinicians, biologists, computer scientists) to attack big data challenges, but with each other. Your time as a statistics trainee is an amazing opportunity to explore your strengths and preferences, and which sectors and jobs appeal to you. Do your due diligence to figure out which employers are interested in and supportive of the type of career you want to create for yourself. Think about how you want to spend your time, and remember that you're the only person who has to live your life once you get that job. Other people's opinions are great, but your values and instincts matter too. Your definition of "best" doesn't have to match someone else's. Ask questions! Try new things! The potential for breakthroughs with novel flexible methods is strong. Statistical science training has progressed to the point where trainees are armed with thorough knowledge in design, methodology, theory, and, increasingly, data collection, applications, and computation.  Statisticians working in data science are poised to continue making important contributions in all sectors for years to come. Now, you just need to decide where you fit.

The trouble with evaluating anything

It is very hard to evaluate people's productivity or work in any meaningful way. This problem is the source of:

  1. Consternation about peer review
  2. The reason why post publication peer review doesn't work
  3. Consternation about faculty evaluation
  4. Major problems at companies like Yahoo and Microsoft.

Roger and I were just talking about this problem in the context of evaluating the impact of software as a faculty member and Roger suggested the problem is that:

Evaluating people requires real work and so people are always looking for shortcuts

To evaluate a person's work or their productivity requires three things:

  1. To be an expert in what they do
  2. To have absolutely no reason to care whether they succeed or not
  3. To have time available to evaluate them

These three fundamental things are at the heart of why it is so hard to get good evaluations of people and why peer review and other systems are under such fire. The main source of the problem is the conflict between 1 and 2. The group of people in any organization or on any scale that is truly world class at any given topic from software engineering to history is small. It has to be by definition. This group of people inevitably has some reason to care about the success of the other people in that same group. Either they work with the other world class people and want them to succeed or they  either intentionally or unintentionally are competing with them.

The conflict between being and expert and having no say wouldn't be such a problem if it wasn't for issue number 3: the time to evaluate people. To truly get good evaluations what you need is for someone who isn't an expert in a field and so has no stake to take the time to become an expert and then evaluate the person/software. But this requires a huge amount of effort on the part of a reviewer who has to become expert in a new field. Given that reviewing is often considered the least important task in people's workflow, evidenced by the value we put on people acting as peer reviewers for journals, or the value people get for doing a good job in people's evaluation for promotion in companies, it is no wonder people don't take the time to become experts.

I actually think that tenure review committees at forward thinking places may be the best at this (Rafa said the same thing about NIH study section). They at least attempt to get outside reviews from people who are unbiased about the work that a faculty member is doing before they are promoted. This system, of course, has large and well-document problems, but I think it is better than having a person's direct supervisor - who clearly has a stake - being the only person evaluating them.It is also better than only using the quantifiable metrics like number of papers and impact factor of the corresponding journals. I also think that most senior faculty who evaluate people take the job very seriously despite the only incentive being good citizenship.

Since real evaluation requires hard work and expertise, most of the time people are looking for a short cut. These short cuts typically take the form of quantifiable metrics. In the academic world these shortcuts are things like:

  1. Number of papers
  2. Citations to academic papers
  3. The impact factor of a journal
  4. Downloads to a person's software

I think all of these things are associated with quality but none define quality. You could try to model the relationship, but it is very hard to come up with a universal definition for the outcome you are trying to model. In academics, some people have suggested that open review or post-publication review solves the problem. But this is only true for a very small subset of cases that violate rule number 2. The only papers that get serious post-publication review are where people have an incentive for the paper to go one way or the other. This means that papers in Science will be post-pub reviewed much much more often than equally important papers in discipline specific journals - just because people care more about Science. This will leave the vast majority of papers unreviewed - as evidenced by the relatively modest number of papers reviewed by PubPeer or Pubmed Commons.

I'm beginning to think that the only way to do evaluation well is to hire people whose only job is to evaluate something well. In other words, peer reviewers who are paid to review papers full time and are only measured by how often those papers are retracted or proved false. Or tenure reviewers who are paid exclusively to evaluate tenure cases and are measured by how well the post-tenure process goes for the people they evaluate and whether there is any measurable bias in their reviews.

The trouble with evaluating anything is that it is hard work and right now we aren't paying anyone to do it.



Early data on knowledge units - atoms of statistical education

Yesterday I posted about atomizing statistical education into knowledge units. You can try out the first knowledge unit here: The early data is in and it is consistent with many of our hypotheses about the future of online education.


  1. Completion rates are high when segments are shorter
  2. You can learn something about statistics in a short amount of time (2 minutes to complete, many people got all questions right)
  3. People will consume educational material on tablets/smartphones more and more.

Screen Shot 2015-02-05 at 9.34.51 AM



Knowledge units - the atoms of statistical education

Editor's note: This idea is Brian's idea and based on conversations with him and Roger, but I just executed it.

The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.

Data collected from massive online open courses suggest that shrinking both the length of recorded lectures and the length of courses leads to higher student retention. These results line up with data on other online activities such as Youtube video watching or form completion, which also show that shorter activities lead to higher completion rates.

We have  some of the earliest and most highly subscribed massive online open courses through the Coursera platform: Data Analysis, Computing for Data Analysis, and Mathematical Biostatistics Bootcamp. Our original courses were translated from courses we offered locally and were therefore closer to semester long with longer lectures ranging from 15-30 minutes. Based on feedback from our students and the data we observed about completion rates, we made the decision to break our courses down into smaller, one-month courses with no more than two hours of lecture material per week. Since then, we have enrolled more than a million students in our MOOCs.

The data suggest that the shorter you can make an academic unit online, the higher the completion percentage. The question then becomes “How short can you make an online course?” To answer this question requires a definition of a course. For our purposes we will define a course as an educational unit consisting of the following three components:


  • Knowledge delivery - the distribution of educational material through lectures, audiovisual materials, and course notes.
  • Knowledge evaluation - the evaluation of how much of the knowledge delivered to a student is retained.
  • Knowledge certification - an independent claim or representation that a student has learned some set of knowledge.


A typical university class delivers 36 hours = 12 weeks x 3 hours/week of content knowledge, evaluates that knowledge based on the order of 10 homework assignments and 2 tests, and results in a certification equivalent to 3 university credits.With this definition, what is the smallest possible unit that satisfies all three definitions of a course? We will call this smallest possible unit one knowledge unit. The smallest knowledge unit that satisfies all three definitions is a course that:

  • Delivers a single unit of content - We will define a single unit of content as a text, image, or video describing a single concept.
  • Evaluates that single unit of content -  The smallest unit of evaluation possible is a single question to evaluate a student’s knowledge.
  • Certifies knowlege - Provides the student with a statement of successful evaluation of the knowledge in the knowledge unit.

An example of a knowledge unit appears here: The knowledge unit consists of a short (less than 2 minute) video and 3 quiz questions. When completed, the unit sends the completer an email verifying that the quiz has been completed. Just as an atom is the smallest unit of mass that defines a chemical element, the knowledge unit is the smallest unit of education that defines a course.

Shrinking the units down to this scale opens up some ideas about how you can connect them together into courses and credentials. I'll leave that for a future post.