Simply Statistics


Data as an antidote to aggressive overconfidence

A recent NY Times op-ed reminded us of the many biases faced by women at work. A followup op-ed  gave specific recommendations for how to conduct ourselves in meetingsIn general, I found these very insightful, but don't necessarily agree with the recommendations that women should "Practice Assertive Body Language".  Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call aggressive overconfidence.

Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist's lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is interrupted so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can't make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.

That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is highly correlated with being male, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.



Gorging ourselves on "free" health care: Harvard's dilemma

Editor's note: This is a guest post by Laura Hatfield. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets @bioannie

I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be writing about my benefits package. Then a vocal and aggrieved group of faculty rebelled against health benefits changes for 2015, and commentators responded by gleefully skewering entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.

Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.


Source: Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.

If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).

At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.

Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?




One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.

The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.

What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the Cadillac tax provision of the ACA. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably expect to make up that difference in higher wages. The empirical evidence is complicated and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.

As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel argues that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.


Statistics and R for the Life Sciences: New HarvardX course starts January 19

The first course of our Biomedical Data Science online curriculum
starts next week. You can sign up here. Instead of relying on
mathematical formulas to teach statistical concepts, students can
program along as we show computer code for simulations that illustrate
the main ideas of exploratory data analysis and statistical inference
(p-values, confidence intervals and power calculations for example).
By doing this, students will learn Statistics and R simultaneously and
will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each
video students will have the opportunity to assess their understanding
through homeworks involving coding in R. A big improvement over the
first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.




On how meetings and conference calls are disruptive to a data scientist

Editor's note: The week of Xmas eve is usually my most productive of the year. This is because there is reduced emails and 0 meetings (I do take a break, but after this great week for work). Here is a repost of one of our first entries explaining how meetings and conference calls are particularly disruptive in data science. 

In this TED talk Jason Fried explains why work doesn't happen at work. He describes the evils of meetings. Meetings are particularly disruptive for applied statisticians, especially for those of us that hack data files, explore data for systematic errors, get inspiration from visual inspection, and thoroughly test our code. Why? Before I become productive I go through a ramp-up/boot-up stage. Scripts need to be found, data loaded into memory, and most importantly, my brains needs to re-familiarize itself with the data and the essence of the problem at hand. I need a similar ramp up for writing as well. It usually takes me between 15 to 60 minutes before I am in full-productivity mode. But once I am in “the zone”, I become very focused and I can stay in this mode for hours. There is nothing worse than interrupting this state of mind to go to a meeting. I lose much more than the hour I spend at the meeting. A short way to explain this is that having 10 separate hours to work is basically nothing, while having 10 hours in the zone is when I get stuff done.

Of course not all meetings are a waste of time. Academic leaders and administrators need to consult and get advice before making important decisions. I find lab meetings very stimulating and, generally, productive: we unstick the stuck and realign the derailed. But before you go and set up a standing meeting consider this calculation: a weekly one hour meeting with 20 people translates into 1 hour x 20 people x 52 weeks/year = 1040 person hours of potentially lost production per year. Assuming 40 hour weeks, that translates into six months. How many grants, papers, and lectures can we produce in six months? And this does not take into account the non-linear effect described above. Jason Fried suggest you cancel your next meeting, notice that nothing bad happens and enjoy the extra hour of work.

I know many others that are like me in this regard and for you I have these recommendations: 1- avoid unnecessary meetings, especially if you are already in full-productivity mode. Don’t be afraid to use this as an excuse to cancel.  If you are in a soft $ institution, remember who pays your salary.  2- Try to bunch all the necessary meetings all together into one day. 3- Separate at least one day a week to stay home and work for 10 hours straight. Jason Fried also recommends that every work place declare a day in which no one talks. No meetings, no chit-chat, no friendly banter, etc… No talk Thursdays anyone?


Kobe, data says stop blaming your teammates

This year, Kobe leads the league in missed shots (by a lot), has an abysmal FG% of 39 and his team plays better when he is on the bench. Yet he blames his teammates for the Lakers' 6-16 record. Below is a plot showing that 2014 is not the first time the Lakers are mediocre during Kobe's tenure. It shows the percentage points above .500 per season with the Shaq and twin towers eras highlighted. I include the same plot for Lebron as a control.


So stop blaming your teammates!

And here is my hastily written code (don't judge me!).




Genéticamente, no hay tal cosa como la raza puertorriqueña

Editor's note: Last week the Latin American media picked up a blog post with the eye-catching title "The perfect human is Puerto Rican". More attention appears to have been given to the title than the post itself. The coverage and comments on social media have demonstrated the need for scientific education on the topic of genetics and race. Here I will try to explain, in layman's terms, why the interpretations I read in the main Puerto Rican paper is scientifically incorrect and somewhat concerning. The post is in Spanish.

En un artículo reciente titulado “Ser humano perfecto sería puertorriqueño", El Nuevo Día resumió una entrada en el blog (erróneamente llamado un estudio) del matemático Lior Pachter. El autor del blog, intentando ridiculizar comentarios racistas que escuchó decir a James Watson, describe un experimento mental en el cual encuentra que el humano “perfecto” (las comilla son importantes), de existir, pertenecería a un grupo genéticamente mezclado. De las personas estudiadas,  la más genéticamente cercana a su humano “perfecto” resultó ser una mujer puertorriqueña. La motivación de este ejercicio era ridiculizar la idea de que una raza puede ser superior a otra. El Nuevo Día parece no captar este punto y nos dice que “el experto concluyó que en todo caso no es de sorprenderse que la persona más cercana a tal perfección sería una puertorriqueña, debido a la combinación de buenos genes que tiene la raza puertorriqueña.” Aquí describo por qué esta interpretación es científicamente errada.

¿Qué es el genoma?
El genoma humano codifica (en moléculas de ADN) la información genética necesaria para nuestro desarrollo biológico. Podemos pensar en el genoma como dos series de 3,000,000,000 letras (A, T, C o G) concatenadas. Una la recibimos de nuestro padre y la otra de nuestra madre. Distintos pedazos (los genes) codifican proteínas necesarias para las miles de funciones que cumplen nuestras células y que conllevan a algunas de nuestras características físicas. Con unas pocas excepciones, todas las células en nuestro cuerpo contienen una copia exacta de estas dos series de letras. El esperma y el huevo tienen sólo una serie de letras, una mezcla de las otras dos. Cuando se unen el esperma y el huevo, la nueva célula, el cigoto, une las dos series y es así que heredamos características de cada progenitor.

¿Qué es la variación genética?
Si todos venimos del primer humano,¿cómo entonces es que somos diferentes? Aunque es muy raro, estas letras a veces mutan aleatoriamente. Por ejemplo, una C puede cambiar a una T. A través de cientos de miles de años suficientes mutaciones han ocurrido para crear variación entre los humanos. La teoría de selección natural nos dice que si esta mutación confiere una ventaja para la supervivencia, el que la posee tiene más probabilidad de pasarla a sus descendientes. Por ejemplo, en Europa la piel clara es más ventajosa, por su habilidad de absorber vitamina D cuando hay poco sol, que en África Occidental donde la melanina en la piel oscura protege del sol intenso. Se estima que las diferencias entre los humanos se pueden encontrar en por lo menos 10 millones de las 3 mil millones de letras (noten que es menos de 1%).

Genéticamente, ¿qué es una “raza” ?
Esta es un pregunta controversial. Lo que no es controversial es que si comparamos la serie de letras de los europeos del norte con los africanos occidentales o con los indígenas de las Américas, encontramos pedazos del código que son únicos a cada región. Si estudiamos las partes del código que cambian entre humanos, fácilmente podemos distinguir los tres grupos. Esto no nos debe sorprender dado que, por ejemplo, la diferencia en el color de ojos y la pigmentación de la piel se codifica con distintas letras en los genes asociados con estas características. En este sentido podríamos crear una definición genética de “raza” basada en las letras que distinguen a estos grupos. Ahora bien, ¿podemos hacer lo mismo para distinguir un puertorriqueño de un dominicano? ¿Podemos crear una definición genética que incluye a Carlos Delgado y a Mónica Puig, pero no a Robinson Canó y Juan Luis Guerra? La literatura científica nos dice que no.


En una serie de artículos , el genético Carlos Bustamante y sus colegas han estudiado los genomas de personas de varios grupos étnicos. Ellos definen una distancia genética que resumen con dos dimensiones en la gráfica arriba. Cada punto es una persona y el color presenta a su grupo. Noten los tres extremos de la gráfica con muchos puntos del mismo color amontonados. Estos son los europeos blancos (puntos rojo), africanos occidentales (verde) e indígenas americanos (azul). Los puntos más regados en el medio son las poblaciones mezcladas. Entre los europeos y los indígenas vemos a los mexicanos y entre los europeos y africanos a los afroamericanos. Los puertorriqueños son los puntos anaranjados. He resaltado con números a tres de ellos. El 1 está cerca del supuesto humano “perfecto”. El 2 es indistinguible de un europeo y el 3 es indistinguible de un afroamericano. Los demás cubrimos un espectro amplio. También resalto con el número 4 a un dominicano que está tan cerca a la “perfección” como la puertorriqueña. La observación principal es que hay mucha variación genética entre los puertorriqueños. En los que Bustamante estudió, la ascendencia africana varía de 5-60%, la europea de 35-95% y la taína de 0-20%. ¿Cómo entonces podemos hablar de una "raza" puertorriqueña cuando nuestros genomas abarcan un espacio tan grande que puede incluir, entre otros, europeos, afroamericanos y dominicanos  ?

¿Qué son los genes “buenos”?
Algunas mutaciones son letales. Otras resultan en cambios a proteínas que causan enfermedades como la fibrosis quística y requieren que ambos padres tengan la mutación. Por lo tanto la mezcla de genomas diferentes disminuye las probabilidades de estas enfermedades. Recientemente una serie de estudios ha encontrado ventajas de algunas combinaciones de letras relacionadas a enfermedades comunes como la hipertensión. Una mezcla genética que evita tener dos copias de estos genes con más riesgo puede ser ventajosa. Pero las supuestas ventajas son pequeñísimas y específicas a enfermedades, no a otras características que asociamos con la “perfección”. El concepto de “genes buenos” es un vestigio de la eugenesia.

A pesar de nuestros problemas sociales y económicos actuales, Puerto Rico tiene mucho de lo cual estar orgulloso. En particular, producimos buenísimos ingenieros, atletas y músicos. Atribuir su éxito a “genes buenos” de nuestra “raza” no sólo es un disparate científico, sino una falta de respeto a estos individuos que a través del trabajo duro, la disciplina y el esmero han logrado lo que han logrado. Si quieren saber si Puerto Rico tuvo algo que ver con el éxito de estos individuos, pregúntenle a un historiador, un antropólogo o un sociólogo y no a un genetista. Ahora, si quieren aprender del potencial de estudiar genomas para mejorar tratamientos médicos y la importancia de estudiar una diversidad de individuos, un genetista tendrá mucho que compartir.


Thinking Like a Statistician: Social Media and the ‘Spiral of Silence’

A few months ago the Pew Research Internet Project published a paper on social media and the ‘spiral of silence’. Their main finding is that people are less likely to discuss a controversial topic on social media than in person. Unlike others, I  did not find this result surprising, perhaps because I think like a statistician.

Shares or retweets of published opinions on controversial political topics - religion, abortion rights, gender inequality, immigration, income inequality, race relations, the role of government, foreign policy, education, climate change - are ubiquitous in social media. These are usually accompanied by passionate statements of strong support or outraged disagreement. Because these are posted by people we elect to follow, we generally agree with what we see on our feeds. Here is a statistical explanation for why many keep silent when they disagree.

We will summarize the political view of an individual as their opinions on the 10 topics listed above. For simplicity I will assume these opinions can be quantified with a left (liberal) to right (conservative) scale. Every individual can therefore be defined by a point in a 10 dimensional space. Once quantified in this way, we can define a political distance between any pair of individuals. In the American landscape there are two clear clusters which I will call the Fox News and MSNBC clusters. As seen in the illustration below, the cluster centers are very far from each other and individuals within the clusters are very close. Each cluster has a very low opinion of the other. A glance through a social media feed will quickly reveal individuals squarely inside one of these clusters. Members of the clusters fearlessly post their opinions on controversial topics as this behavior is rewarded by likes, retweets or supportive comments from others in their cluster. Based on the uniformity of opinion inferred from the comments, one would think that everybody is in one of these two groups. But this is obviously not the case.


In the illustration above I include an example of an individual (the green dot) that is outside the two clusters. Although not shown, there are many of these independent thinkers. In our example, this individual is very close to the MSNBC cluster, but not in it. The controversial topic posts in this person's feed are mostly posted by those in the cluster of closest proximity, and the spiral of silence is due in part to the fact that independent thinkers are uniformly adverse to disagreeing publicly. For the mathematical explanation of why, we introduce the concept of a projection.

In mathematics, a projection can map a multidimensional point to a smaller, simpler, subset. In our illustration, the independent thinker is very close to the MSNBC cluster on all dimensions except one. To use education as an example, let's say this person supports school choice. As seen in the illustration, in the projection to the education dimension, that mostly liberal person is squarely in the Fox News cluster. Now imagine that a friend shares an article on The Corporate Takeover of Public Education along with a passionate statement of approval. Independent thinkers have a feeling that by voicing their dissent, dozens, perhaps hundreds, of strangers on social media (friends of friends for example) will judge them solely on this projection. To make matters worse, public shaming of the independent thinker, for supposedly being a member of the Fox News cluster, will then be rewarded by increased social standing among the MSNBC cluster as evidenced by retweets, likes and supportive comments. In a worse case scenario for this person, and best case scenario for the critics, this public shaming goes viral. While the short term rewards for preaching to the echo chamber are clear, there are no apparent incentives for dissent.

The superficial and fast paced nature of social media is not amenable to nuances and subtleties. Disagreement with the groupthink on one specific topic can therefore get a person labeled as a "neoliberal corporate shill" by the MSNBC cluster or a "godless liberal" by the Fox News one. The irony is that in social media, those politically closest to you, will be the ones attaching the unwanted label.


HarvardX Biomedical Data Science Open Online Training Curriculum launches on January 19

We recently received funding from the NIH BD2K initiative to develop MOOCs for biomedical data science. Our first offering will be version 2 of my Data Analysis for Genomics course which will launch on January 19. In this version, the course will be turned into an 8 course series and you can get a certificate in each one of them. The motivation for doing this is to go more in-depth into the different topics and to provide different entry points for students with different levels of expertise. We provide four courses on concepts and skills and four case-study based course. We basically broke the original class into the following eight parts:

    1. Statistics and R for the Life Sciences
    2. Introduction to Linear Models and Matrix Algebra
    3. Advanced Statistics for the Life Sciences
    4. Introduction to Bioconductor
    5. Case study: RNA-seq data analysis
    6. Case study: Variant Discovery and Genotyping
    7. Case study: ChIP-seq data analysis
    8. Case study: DNA methylation data analysis

You can follow the links to enroll. While not required, some familiarity with R and Rstudio will serve you well so consider taking Roger's R course and Jeff's Toolbox course before delving into this class.

In years 2 and 3 we plan to introduce several other courses covering topics such as python for data analysis, probability, software engineering, and data visualization which will be taught by a collaboration between the departments of Biostatistics, Statistics and Computer Science at Harvard.

Announcements will be made here and on twitter: @rafalab



Data Science Students Predict the Midterm Election Results

As explained in an earlier post, one of the homework assignments of my CS109 class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.

The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up.  However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.

One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can't infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was very large in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year's polls did very poorly. The 90% credible intervals provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance.  Several of our students compared favorably to 538:

name avg bias MSE avg interval size # missed
Manuel Andere -3.9 6.9 24.1 3
Richard Lopez -5.0 7.4 26.9 3
Daniel Sokol -4.5 6.4 24.1 4
Isabella Chiu -5.3 9.6 26.9 6
Denver Mosigisi Ogaro -3.2 6.6 18.9 7
Yu Jiang -5.6 9.6 22.6 7
David Dowey -3.5 6.2 16.3 8
Nate Silver -4.2 6.6 16.4 8
Filip Piasevoli -3.5 7.4 22.1 8
Yapeng Lu -6.5 8.2 16.5 10
David Jacob Lieb -3.7 7.2 17.1 10
Vincent Nguyen -3.8 5.9 11.1 14

It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538's 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican's favor.

Here are some plots showing results from two of our top performers:

Rplot Rplot01

I hope this exercise helped students realize that data science can be both fun and useful. I can't wait to do this again in 2016.





538 election forecasts made simple

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.