Simply Statistics

12
Dec

Kobe, data says stop blaming your teammates

This year, Kobe leads the league in missed shots (by a lot), has an abysmal FG% of 39 and his team plays better when he is on the bench. Yet he blames his teammates for the Lakers' 6-16 record. Below is a plot showing that 2014 is not the first time the Lakers are mediocre during Kobe's tenure. It shows the percentage points above .500 per season with the Shaq and twin towers eras highlighted. I include the same plot for Lebron as a control.

Rplot

So stop blaming your teammates!

And here is my hastily written code (don't judge me!).

 

 


							
08
Dec

Genéticamente, no hay tal cosa como la raza puertorriqueña

Editor's note: Last week the Latin American media picked up a blog post with the eye-catching title "The perfect human is Puerto Rican". More attention appears to have been given to the title than the post itself. The coverage and comments on social media have demonstrated the need for scientific education on the topic of genetics and race. Here I will try to explain, in layman's terms, why the interpretations I read in the main Puerto Rican paper is scientifically incorrect and somewhat concerning. The post is in Spanish.

En un artículo reciente titulado “Ser humano perfecto sería puertorriqueño", El Nuevo Día resumió una entrada en el blog (erróneamente llamado un estudio) del matemático Lior Pachter. El autor del blog, intentando ridiculizar comentarios racistas que escuchó decir a James Watson, describe un experimento mental en el cual encuentra que el humano “perfecto” (las comilla son importantes), de existir, pertenecería a un grupo genéticamente mezclado. De las personas estudiadas,  la más genéticamente cercana a su humano “perfecto” resultó ser una mujer puertorriqueña. La motivación de este ejercicio era ridiculizar la idea de que una raza puede ser superior a otra. El Nuevo Día parece no captar este punto y nos dice que “el experto concluyó que en todo caso no es de sorprenderse que la persona más cercana a tal perfección sería una puertorriqueña, debido a la combinación de buenos genes que tiene la raza puertorriqueña.” Aquí describo por qué esta interpretación es científicamente errada.

¿Qué es el genoma?
El genoma humano codifica (en moléculas de ADN) la información genética necesaria para nuestro desarrollo biológico. Podemos pensar en el genoma como dos series de 3,000,000,000 letras (A, T, C o G) concatenadas. Una la recibimos de nuestro padre y la otra de nuestra madre. Distintos pedazos (los genes) codifican proteínas necesarias para las miles de funciones que cumplen nuestras células y que conllevan a algunas de nuestras características físicas. Con unas pocas excepciones, todas las células en nuestro cuerpo contienen una copia exacta de estas dos series de letras. El esperma y el huevo tienen sólo una serie de letras, una mezcla de las otras dos. Cuando se unen el esperma y el huevo, la nueva célula, el cigoto, une las dos series y es así que heredamos características de cada progenitor.

¿Qué es la variación genética?
Si todos venimos del primer humano,¿cómo entonces es que somos diferentes? Aunque es muy raro, estas letras a veces mutan aleatoriamente. Por ejemplo, una C puede cambiar a una T. A través de cientos de miles de años suficientes mutaciones han ocurrido para crear variación entre los humanos. La teoría de selección natural nos dice que si esta mutación confiere una ventaja para la supervivencia, el que la posee tiene más probabilidad de pasarla a sus descendientes. Por ejemplo, en Europa la piel clara es más ventajosa, por su habilidad de absorber vitamina D cuando hay poco sol, que en África Occidental donde la melanina en la piel oscura protege del sol intenso. Se estima que las diferencias entre los humanos se pueden encontrar en por lo menos 10 millones de las 3 mil millones de letras (noten que es menos de 1%).

Genéticamente, ¿qué es una “raza” ?
Esta es un pregunta controversial. Lo que no es controversial es que si comparamos la serie de letras de los europeos del norte con los africanos occidentales o con los indígenas de las Américas, encontramos pedazos del código que son únicos a cada región. Si estudiamos las partes del código que cambian entre humanos, fácilmente podemos distinguir los tres grupos. Esto no nos debe sorprender dado que, por ejemplo, la diferencia en el color de ojos y la pigmentación de la piel se codifica con distintas letras en los genes asociados con estas características. En este sentido podríamos crear una definición genética de “raza” basada en las letras que distinguen a estos grupos. Ahora bien, ¿podemos hacer lo mismo para distinguir un puertorriqueño de un dominicano? ¿Podemos crear una definición genética que incluye a Carlos Delgado y a Mónica Puig, pero no a Robinson Canó y Juan Luis Guerra? La literatura científica nos dice que no.

PCAfinal

En una serie de artículos , el genético Carlos Bustamante y sus colegas han estudiado los genomas de personas de varios grupos étnicos. Ellos definen una distancia genética que resumen con dos dimensiones en la gráfica arriba. Cada punto es una persona y el color presenta a su grupo. Noten los tres extremos de la gráfica con muchos puntos del mismo color amontonados. Estos son los europeos blancos (puntos rojo), africanos occidentales (verde) e indígenas americanos (azul). Los puntos más regados en el medio son las poblaciones mezcladas. Entre los europeos y los indígenas vemos a los mexicanos y entre los europeos y africanos a los afroamericanos. Los puertorriqueños son los puntos anaranjados. He resaltado con números a tres de ellos. El 1 está cerca del supuesto humano “perfecto”. El 2 es indistinguible de un europeo y el 3 es indistinguible de un afroamericano. Los demás cubrimos un espectro amplio. También resalto con el número 4 a un dominicano que está tan cerca a la “perfección” como la puertorriqueña. La observación principal es que hay mucha variación genética entre los puertorriqueños. En los que Bustamante estudió, la ascendencia africana varía de 5-60%, la europea de 35-95% y la taína de 0-20%. ¿Cómo entonces podemos hablar de una "raza" puertorriqueña cuando nuestros genomas abarcan un espacio tan grande que puede incluir, entre otros, europeos, afroamericanos y dominicanos  ?

¿Qué son los genes “buenos”?
Algunas mutaciones son letales. Otras resultan en cambios a proteínas que causan enfermedades como la fibrosis quística y requieren que ambos padres tengan la mutación. Por lo tanto la mezcla de genomas diferentes disminuye las probabilidades de estas enfermedades. Recientemente una serie de estudios ha encontrado ventajas de algunas combinaciones de letras relacionadas a enfermedades comunes como la hipertensión. Una mezcla genética que evita tener dos copias de estos genes con más riesgo puede ser ventajosa. Pero las supuestas ventajas son pequeñísimas y específicas a enfermedades, no a otras características que asociamos con la “perfección”. El concepto de “genes buenos” es un vestigio de la eugenesia.

A pesar de nuestros problemas sociales y económicos actuales, Puerto Rico tiene mucho de lo cual estar orgulloso. En particular, producimos buenísimos ingenieros, atletas y músicos. Atribuir su éxito a “genes buenos” de nuestra “raza” no sólo es un disparate científico, sino una falta de respeto a estos individuos que a través del trabajo duro, la disciplina y el esmero han logrado lo que han logrado. Si quieren saber si Puerto Rico tuvo algo que ver con el éxito de estos individuos, pregúntenle a un historiador, un antropólogo o un sociólogo y no a un genético. Ahora, si quieren aprender del potencial de estudiar genomas para mejorar tratamientos médicos y la importancia de estudiar una diversidad de individuos, un genetista tendrá mucho que compartir.

02
Dec

Thinking Like a Statistician: Social Media and the ‘Spiral of Silence’

A few months ago the Pew Research Internet Project published a paper on social media and the ‘spiral of silence’. Their main finding is that people are less likely to discuss a controversial topic on social media than in person. Unlike others, I  did not find this result surprising, perhaps because I think like a statistician.

Shares or retweets of published opinions on controversial political topics - religion, abortion rights, gender inequality, immigration, income inequality, race relations, the role of government, foreign policy, education, climate change - are ubiquitous in social media. These are usually accompanied by passionate statements of strong support or outraged disagreement. Because these are posted by people we elect to follow, we generally agree with what we see on our feeds. Here is a statistical explanation for why many keep silent when they disagree.

We will summarize the political view of an individual as their opinions on the 10 topics listed above. For simplicity I will assume these opinions can be quantified with a left (liberal) to right (conservative) scale. Every individual can therefore be defined by a point in a 10 dimensional space. Once quantified in this way, we can define a political distance between any pair of individuals. In the American landscape there are two clear clusters which I will call the Fox News and MSNBC clusters. As seen in the illustration below, the cluster centers are very far from each other and individuals within the clusters are very close. Each cluster has a very low opinion of the other. A glance through a social media feed will quickly reveal individuals squarely inside one of these clusters. Members of the clusters fearlessly post their opinions on controversial topics as this behavior is rewarded by likes, retweets or supportive comments from others in their cluster. Based on the uniformity of opinion inferred from the comments, one would think that everybody is in one of these two groups. But this is obviously not the case.

plotforpost

In the illustration above I include an example of an individual (the green dot) that is outside the two clusters. Although not shown, there are many of these independent thinkers. In our example, this individual is very close to the MSNBC cluster, but not in it. The controversial topic posts in this person's feed are mostly posted by those in the cluster of closest proximity, and the spiral of silence is due in part to the fact that independent thinkers are uniformly adverse to disagreeing publicly. For the mathematical explanation of why, we introduce the concept of a projection.

In mathematics, a projection can map a multidimensional point to a smaller, simpler, subset. In our illustration, the independent thinker is very close to the MSNBC cluster on all dimensions except one. To use education as an example, let's say this person supports school choice. As seen in the illustration, in the projection to the education dimension, that mostly liberal person is squarely in the Fox News cluster. Now imagine that a friend shares an article on The Corporate Takeover of Public Education along with a passionate statement of approval. Independent thinkers have a feeling that by voicing their dissent, dozens, perhaps hundreds, of strangers on social media (friends of friends for example) will judge them solely on this projection. To make matters worse, public shaming of the independent thinker, for supposedly being a member of the Fox News cluster, will then be rewarded by increased social standing among the MSNBC cluster as evidenced by retweets, likes and supportive comments. In a worse case scenario for this person, and best case scenario for the critics, this public shaming goes viral. While the short term rewards for preaching to the echo chamber are clear, there are no apparent incentives for dissent.

The superficial and fast paced nature of social media is not amenable to nuances and subtleties. Disagreement with the groupthink on one specific topic can therefore get a person labeled as a "neoliberal corporate shill" by the MSNBC cluster or a "godless liberal" by the Fox News one. The irony is that in social media, those politically closest to you, will be the ones attaching the unwanted label.

25
Nov

HarvardX Biomedical Data Science Open Online Training Curriculum launches on January 19

We recently received funding from the NIH BD2K initiative to develop MOOCs for biomedical data science. Our first offering will be version 2 of my Data Analysis for Genomics course which will launch on January 19. In this version, the course will be turned into an 8 course series and you can get a certificate in each one of them. The motivation for doing this is to go more in-depth into the different topics and to provide different entry points for students with different levels of expertise. We provide four courses on concepts and skills and four case-study based course. We basically broke the original class into the following eight parts:

    1. Statistics and R for the Life Sciences
    2. Introduction to Linear Models and Matrix Algebra
    3. Advanced Statistics for the Life Sciences
    4. Introduction to Bioconductor
    5. Case study: RNA-seq data analysis
    6. Case study: Variant Discovery and Genotyping
    7. Case study: ChIP-seq data analysis
    8. Case study: DNA methylation data analysis

You can follow the links to enroll. While not required, some familiarity with R and Rstudio will serve you well so consider taking Roger's R course and Jeff's Toolbox course before delving into this class.

In years 2 and 3 we plan to introduce several other courses covering topics such as python for data analysis, probability, software engineering, and data visualization which will be taught by a collaboration between the departments of Biostatistics, Statistics and Computer Science at Harvard.

Announcements will be made here and on twitter: @rafalab

 

12
Nov

Data Science Students Predict the Midterm Election Results

As explained in an earlier post, one of the homework assignments of my CS109 class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.

The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up.  However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.

One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can't infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was very large in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year's polls did very poorly. The 90% credible intervals provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance.  Several of our students compared favorably to 538:

name avg bias MSE avg interval size # missed
Manuel Andere -3.9 6.9 24.1 3
Richard Lopez -5.0 7.4 26.9 3
Daniel Sokol -4.5 6.4 24.1 4
Isabella Chiu -5.3 9.6 26.9 6
Denver Mosigisi Ogaro -3.2 6.6 18.9 7
Yu Jiang -5.6 9.6 22.6 7
David Dowey -3.5 6.2 16.3 8
Nate Silver -4.2 6.6 16.4 8
Filip Piasevoli -3.5 7.4 22.1 8
Yapeng Lu -6.5 8.2 16.5 10
David Jacob Lieb -3.7 7.2 17.1 10
Vincent Nguyen -3.8 5.9 11.1 14

It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538's 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican's favor.

Here are some plots showing results from two of our top performers:

Rplot Rplot01

I hope this exercise helped students realize that data science can be both fun and useful. I can't wait to do this again in 2016.

 

 

 

04
Nov

538 election forecasts made simple

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.

 

 

 

28
Oct

Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.

 

 

20
Oct

Thinking like a statistician: don't judge a society by its internet comments

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends' profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual's "righteousness" with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.

IMG_5842

Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the "bad" tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people's opinions and thoughts. We assign a "righteousness" index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity's distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.

 

 

17
Oct

Bayes Rule in an animated gif

Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B|A)=Pr(no B| no A) = 90% .  The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B|A). We want to know the probability of having the disease if you tested positive: Pr(A|B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.

The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B|A), thus Pr(B|A) x Pr(A). Pr(A|B), or the proportion of reds in the bottom left, is therefore Pr(B|A) x Pr(A) divided by Pr(B):  Pr(A|B)=Pr(B|A) x Pr(A) / Pr(B)

ps - Is this a frequentist or Bayesian gif?

13
Oct

I declare the Bayesian vs. Frequentist debate over for data scientists

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.