Simply Statistics


Knowledge units - the atoms of statistical education

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This idea is Brian's idea and based on conversations with him and Roger, but I just executed it.

The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.

Data collected from massive online open courses suggest that shrinking both the length of recorded lectures and the length of courses leads to higher student retention. These results line up with data on other online activities such as Youtube video watching or form completion, which also show that shorter activities lead to higher completion rates.

We have  some of the earliest and most highly subscribed massive online open courses through the Coursera platform: Data Analysis, Computing for Data Analysis, and Mathematical Biostatistics Bootcamp. Our original courses were translated from courses we offered locally and were therefore closer to semester long with longer lectures ranging from 15-30 minutes. Based on feedback from our students and the data we observed about completion rates, we made the decision to break our courses down into smaller, one-month courses with no more than two hours of lecture material per week. Since then, we have enrolled more than a million students in our MOOCs.

The data suggest that the shorter you can make an academic unit online, the higher the completion percentage. The question then becomes “How short can you make an online course?” To answer this question requires a definition of a course. For our purposes we will define a course as an educational unit consisting of the following three components:


  • Knowledge delivery - the distribution of educational material through lectures, audiovisual materials, and course notes.
  • Knowledge evaluation - the evaluation of how much of the knowledge delivered to a student is retained.
  • Knowledge certification - an independent claim or representation that a student has learned some set of knowledge.


A typical university class delivers 36 hours = 12 weeks x 3 hours/week of content knowledge, evaluates that knowledge based on the order of 10 homework assignments and 2 tests, and results in a certification equivalent to 3 university credits.With this definition, what is the smallest possible unit that satisfies all three definitions of a course? We will call this smallest possible unit one knowledge unit. The smallest knowledge unit that satisfies all three definitions is a course that:

  • Delivers a single unit of content - We will define a single unit of content as a text, image, or video describing a single concept.
  • Evaluates that single unit of content -  The smallest unit of evaluation possible is a single question to evaluate a student’s knowledge.
  • Certifies knowlege - Provides the student with a statement of successful evaluation of the knowledge in the knowledge unit.

An example of a knowledge unit appears here: The knowledge unit consists of a short (less than 2 minute) video and 3 quiz questions. When completed, the unit sends the completer an email verifying that the quiz has been completed. Just as an atom is the smallest unit of mass that defines a chemical element, the knowledge unit is the smallest unit of education that defines a course.

Shrinking the units down to this scale opens up some ideas about how you can connect them together into courses and credentials. I'll leave that for a future post.


Precision medicine may never be very precise - but it may be good for public health

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This post was originally titled: Personalized medicine is primarily a population health intervention. It has been updated with the graph of odds ratios/betas from GWAS studies.

There has been a lot of discussion of personalized medicine, individualized health, and precision medicine in the news and in the medical research community and President Obama just announced a brand new initiative in precision medicine . Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to "personalize" healthcare for those individuals.

So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that has recently been in the news is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.

This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:

  1. In individualized health/personalized medicine the "treatment" is information about risk. In some cases treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be "personalized" in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of understanding uncertainty.
  2. Individualized health/personalized medicine is a population-level treatment. Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the "personal" decision may not always be the "best" decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.

The first point bears serious consideration in light of President Obama's new proposal. We have already collected a massive amount of genetic data about a large number of common diseases. In almost all cases, the amount of predictive information that we can glean from genetic studies is modest. One paper pointed this issue out in a rather snarky way by comparing two approaches to predicting people's heights: (1) averaging their parents heights - an approach from the Victorian era and (2) combing the latest information on the best genetic markers at the time. It turns out, all the genetic information we gathered isn't as good as averaging parents heights. Another way to see this is to download data on all genetic variants associated with disease from the GWAS catalog that have a P-value less than 1 x 10e-8. If you do that and look at the distribution of effect sizes, you see that 95% have an odds ratio or beta coefficient less than about 4. Here is a histogram of the effect sizes:





This means that nearly all identified genetic effects are small. The ones that are really large (effect size greater than 100) are not for common disease outcomes, they are for Birdshot chorioretinopathy and hippocampal volume. You can really see this if you look at the bulk of the distribution of effect sizes, which are mostly less than 2 by zooming the plot on the x-axis:





These effect sizes translate into very limited predictive capacity for most identified genetic biomarkers.  The implication is that personalized medicine, at least for common diseases, is highly likely to be inaccurate for any individual person. But if we can take advantage of the population-level improvements in health from precision medicine by increasing risk literacy, improving our use of uncertain markers, and understanding that precision medicine isn't precise for any one person, it could be a really big deal.


Reproducible Research Course Companion

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Screen Shot 2015-01-26 at 4.14.26 PMI'm happy to announce that you can now get a copy of the Reproducible Research Course Companion from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my Reproducible Research course offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.

If you're interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.

Please note that all of the lectures are still available for free on YouTube via my YouTube channel. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.


Data as an antidote to aggressive overconfidence

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

A recent NY Times op-ed reminded us of the many biases faced by women at work. A followup op-ed  gave specific recommendations for how to conduct ourselves in meetingsIn general, I found these very insightful, but don't necessarily agree with the recommendations that women should "Practice Assertive Body Language".  Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call aggressive overconfidence.

Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist's lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is interrupted so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can't make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.

That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is highly correlated with being male, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.



Gorging ourselves on "free" health care: Harvard's dilemma

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This is a guest post by Laura Hatfield. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets @bioannie

I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be writing about my benefits package. Then a vocal and aggrieved group of faculty rebelled against health benefits changes for 2015, and commentators responded by gleefully skewering entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.

Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.


Source: Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.

If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).

At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.

Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?




One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.

The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.

What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the Cadillac tax provision of the ACA. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably expect to make up that difference in higher wages. The empirical evidence is complicated and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.

As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel argues that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.


If you were going to write a paper about the false discovery rate you should have done it in 2002

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people's genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:



You can see from the plot that the median publication year of the top 30 hits for "empirical processes" was 1990 and for "RNA-seq statistics" was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn't perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the "success" of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited "false discovery rate" paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.


How to find the science paper behind a headline when the link is missing

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.


Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn't linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):


Screen Shot 2015-01-15 at 1.11.22 PM



Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don't get to a scientific journal you aren't finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.


Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
Untitled presentation (2)


And some key words:


Untitled presentation (3)


Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.




Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.


Screen Shot 2015-01-15 at 1.31.38 PM


Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM


Unfortunately, the article may be paywalled, so if you don't work at a university or institute with a subscription, you can always tweet the article name with the hashtag #icanhazpdf and your contact info. Then you just have to hope that someone will send it to you (they often do).




Statistics and R for the Life Sciences: New HarvardX course starts January 19

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The first course of our Biomedical Data Science online curriculum
starts next week. You can sign up here. Instead of relying on
mathematical formulas to teach statistical concepts, students can
program along as we show computer code for simulations that illustrate
the main ideas of exploratory data analysis and statistical inference
(p-values, confidence intervals and power calculations for example).
By doing this, students will learn Statistics and R simultaneously and
will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each
video students will have the opportunity to assess their understanding
through homeworks involving coding in R. A big improvement over the
first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.




Beast mode parenting as shown by my Fitbit data

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:




There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:



Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.




Sunday data/statistics link roundup (1/4/15)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone
  1. I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
  2. I'm very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
  3. I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
  4. If you haven't read it yet, this piece in the Economist on statisticians during the war is really good.
  5. The arXiv celebrated it's 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.