Tag: statistical literacy


Statistical illiteracy may lead to parents panicking about Autism.

I just was doing my morning reading of a few news sources and stumbled across this Huffington Post article talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn't find a link to the research paper in the article so I did some searching and found out this result is also being covered by Time, Science Daily, Medical Daily, and a bunch of other news outlets.

Now thoroughly freaked, I looked online and found the pdf of the original research article. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies' cries can predict autism. Here are the flaws with the study:

  1. Small sample size. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.
  2. Major and unavoidable confounding. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.
  3. No correction for multiple testing. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD even did a cartoon about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction for them. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.

Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.

The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?



Sunday Data/Statistics Link Roundup (10/28/12)

  1. An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to teach statistics to everyone in high school.
  2. Scientific American was loaded this last week, here is another article on cancer screening.  The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter @hildabast.
  3. A bunch of data on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.). 
  4. A really interesting post over at Stats Chat about why we study seemingly obvious things. Hint, the reason is that “obvious” things aren’t always true. 
  5. A story on “sentiment analysis” by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)
  6. Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. Here are his slides from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing.  
  7. Bonus Link: R is a trendy open source technology for big data

Sunday Data/Statistics Link Roundup (10/14/12)

  1. A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. 
  2. A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. 
  3. A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
  4. An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? 
  5. Seriously guys, this is getting embarrassing. An article appears in the New England Journal “finding” an association between chocolate consumption and Nobel prize winners.  This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.

People in positions of power that don't understand statistics are a big problem for genomics

I finally got around to reading the IOM report on translational omics and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous Duke trials and the growth in retractions in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.

In general, I think the field of genomics  (a label that is used quite broadly) is producing great discoveries and I strongly believe we are just getting started. But we can’t help but notice that retraction and questionable findings are particularly high in this field. In my view most of the problems we are currently suffering stem from the fact that a substantial number of the people with positions of power do not understand statistics and have no experience with computing. Nevin’s biggest mistake was not admitting to himself that he did not understand what Baggerly and Coombes were saying. The lack of reproducibility just exacerbated the problem. The same is true for the editors that rejected the letters written by this pair in their effort to expose a serious problem - a problem that was obvious to all the statistics savvy biologists I talked to.

Unfortunately Nevins is not the only head of a large genomics lab that does not understand basic statistical principles and has no programming/data-management experience. So how do people without the necessary statistical and computing skills to be considered experts in genomics become leaders of the field?  I think this is due to the speed at which Biology changed from a data poor discipline to a data intensive ones. For example, before microarrays, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure A below). In the mid 90s this suddenly changed to sifting through tens of thousands of numbers (see Figure B).

Note that typically, statistics is not a requirement of the Biology graduate programs associated with genomics. At Hopkins neither of the two major programs (CMM and BCMB) require it. And this is expected, since outside of genomics one can do great  Biology without quantitative skills and for most of the 20th century most Biology was like this. So when the genomics revolution first arrived, the great majority of powerful Biology lab heads had no statistical training whatsoever. Nonetheless, a few of these decided to delve into this “sexy” new field and using their copious resources were able to perform some of the first big experiments. Similarly, Biology journals that were not equipped to judge the data analytic component of genomics papers were eager to publish papers in this field, a fact that further compounded the problem.

But I as I mentioned above, in general, the field of genomics is producing wonderful results. Several lab heads did have statistics and computational expertise, while others formed strong partnerships with quantitative types. Here I should mentioned that for these partnerships to be successful the statisticians also needed to expand their knowledge base. The quantitative half of the partnership needs to be biology and technology savvy or they too can make mistakes that lead to retractions

Nevertheless, the field is riddled with problems; enough to prompt an IOM report. But although the present is somewhat grim, I am optimistic about the future. The new generation of biologists leading the genomics field are clearly more knowledgeable and appreciative about statistics and computing than the previous ones. Natural selection helps, as these new investigators can’t rely on pre-genomics-revolution accomplishments and those that do not posses these skills are simply outperformed by those that do. I am also optimistic because biology graduate programs are starting to incorporate statistics and computation into their curricula. For example, as of last year, our Human Genetics program requires our Biostats 615-616 course


Nature is hiring a data editor...how will they make sense of the data?

It looks like the journal Nature is hiring a Chief Data Editor (link via Hilary M.). It looks like the primary purpose of this editor is to develop tools for collecting, curating, and distributing data with the goal of improving reproducible research.

The main duties of the editor, as described by the ad are: 

Nature Publishing Group is looking for a Chief Editor to develop a product aimed at making research data more available, discoverable and interpretable.

The ad also mentions having an eye for commercial potential; I wonder if this move was motivated by companies like figshare who are already providing a reproducible data service. I haven’t used figshare, but the early reports from friends who have are that it is great. 

The thing that bothered me about the ad is that there is a strong focus on data collection/storage/management but absolutely no mention of the second component of the data science problem: making sense of the data. To make sense of piles of data requires training in applied statistics (called by whatever name you like best). The ad doesn’t mention any such qualifications. 

Even if the goal of the position is just to build a competitor to figshare, it seems like a good idea for the person collecting the data to have some idea of what researchers are going to do with it. When dealing with data, those researchers will frequently be statisticians by one name or another. 

Bottom line: I’m stoked Nature is recognizing the importance of data in this very prominent way. But I wish they’d realize that a data revolution also requires a revolution in statistics. 


Statistics is not math...

Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:

“Math can be used to solve amazing problem X”


“The Math of Y” 

Here are some examples:

The Mathematics of Lego - Using data on legos to estimate a distribution

The Mathematics of War - Using data on conflicts to estimate a distribution

Usain Bolt can run faster with maths (Tweet) - Turns out they analyzed data on start times to come to the conclusion

The Mathematics of Beauty - Analysis of data relating dating profile responses and photo attractiveness

These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (but not all of them) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly. 

With respect to (3), consider the following list of disciplines:

  1. Biostatistics
  2. Data science
  3. Machine learning
  4. Natural language processing
  5. Signal processing
  6. Business analytics
  7. Econometrics
  8. Text mining
  9. Social science statistics
  10. Process control

All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians? 


R and the little data scientist's predicament

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video. 

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax. 

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background. 
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures. 
  3. It would be awesome if the functions would include some sort of dynamic graphics (with svgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn. 

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends. 


In the era of data what is a fact?

The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks 

I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.

He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting. 

But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating median vs. mean home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high! 

Two concrete things can be done to deal with the malleability of facts in the data age.

First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described.  

But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with numeracy).  In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?

And don’t forget to keep rating health news stories!


The Supreme Court's interpretation of statistical correlation may determine the future of personalized medicine


The Supreme Court heard oral arguments last week in the case Mayo Collaborative Services vs. Prometheus Laboratories (No 10-1150). At issue is a patent Prometheus Laboratories holds for making decisions about the treatment of disease on the basis of a measurement of a specific, naturally occurring molecule and a corresponding calculation. The specific language at issue is a little technical, but the key claim from the patent under dispute is:

1. A method of optimizing therapeutic efficacy for treatment of an immune-mediated gastrointestinal disorder, comprising: 

(a) administering a drug providing 6-thioguanine to a subject having said immune-mediated gastrointestinal disorder; and 

(b) determining the level of 6-thioguanine in said subject having said immune-mediated gastrointestinal disorder,  

wherein the level of 6-thioguanine less than about 230 pmol per 8x10^8 red blood cells indicates a need to increase the amount of said drug subsequently administered to said subject and  

wherein the level of 6-thioguanine greater than about 400 pmol per 8x10^8 red blood cells indicates a need to decrease the amount of said drug subsequently administered to said subject.

So basically the patent is on a decision made about treatment on the basis of a statistical correlation. When the levels of a specific molecule (6-thioguanine) are too high, then the dose of a drug (thiopurine) should be decreased, if they are too low then the dose of the drug should be increased. Here (and throughout the post) correlation is interpreted more loosely as a relationship between two variables; rather than the strict definition as the linear relationship between two quantitative variables. 

This correlation between levels of 6-thioguanine and patient response was first reported by a group of academics in a paper in 1996. Prometheus developed a diagnostic test based on this correlation. Doctors (including those at the Mayo clinic) would draw blood, send it to Prometheus, who would calculate the levels of 6-thioguanine and report them back. 

According to Mayo’s brief, some Doctors at the Mayo, who used this test, decided it was possible to improve on the test. So they developed their own diagnostic test, based on a different measurement of 6-thioguanine (6-TGN) and reported different information including:

  • A blood reading greater than 235 picomoles of 6-TGN is a “target therapeutic range,” and a reading greater than 250 picomoles of 6-TGN is associated with remission in adult patients; and
  • A blood reading greater than 450 picomoles of 6-TGN indicates possible adverse health effects, but in some instances levels over 700 are associated with remission without significant toxicity, while a “clearly defined toxic level” has not been established; and
  • A blood reading greater than 5700 picomoles of 6-MMP is possibly toxic to the liver.

They subsequently created their own proprietary test and started to market that test. At which point Prometheus sued the Mayo Clinic for infringement. The most recent decision on the case was made by a federal circuit court who upheld Prometheus’ claim. A useful summary is here

The arguments for the two sides are summarized in the briefs for each side; for Mayo

Whether 35 U.S.C. § 101 is satisfied by a patent claim that covers observed correlations between blood test results and patient health, so that the patent effectively preempts use of the naturally occurring correlations, simply because well-known methods used to administer prescription drugs and test blood may involve “transformations” of body chemistry.

and for Prometheus

Whether the Federal Circuit correctly held that concrete methods for improving the treatment of patients suffering from autoimmune diseases by using  individualized metabolite measurements to inform the calibration of the patient’s dosages of synthetic thiopurines are patentable processes under 35 U.S.C. §101. 

Basically, Prometheus claims that the patent covers cases where doctors observe a specific data point and make a decision about a specific drug on the basis of that data point and a known correlation with patient outcomes. Mayo, on the other hand, says that since the correlation between the data and the outcome are naturally occurring processes, they can not be patented. 

In the oral arguments, the attorney for Mayo makes the claim that the test is only patentable if Prometheus specifies a specific level for 6-thioguanine and a specific treatment associated with that level (see page 21-24 of the transcript). He then goes on to suggest that the Mayo would then be free to pick another level and another treatment option for their diagnostic test. Justice Breyer disagrees even with this specific option (see page 38 of the transcript and his fertilizer example). He has made this view known before in his dissent to the dismissal of the Labcorp writ of certori (a very similar case focusing on whether a correlation can be patented). 

Brief summary: Prometheus is trying to patent a correlation between a molecule’s level and treatment decisions. Mayo is claiming this is a natural process and can’t be patented.  

Implications for Personalized Medicine (a statistician’s perspective)

I believe this case has major potential consequences for the entire field of personalized medicine. The fundamental idea of personalized medicine is that treatment decisions for individual patients will be tailored on the basis of data collected about them and statistical calculations made on the basis of that data (i.e. correlations, or more complicated statistical functions).

According to my interpretation, if the Supreme Court rules in favor of Mayo in a broad sense, then this suggests that decisions about treatment made on the basis of data and correlation are not broadly patentable. In both the Labcorp dissent and the oral arguments for the Prometheus case, Justice Breyer argues that the process described by the patents:

…instructs the user to (1) obtain test results and (2) think about them. 

He suggests that these are natural correlations and hence can not be patented, just the way a formula like E = mc^2 can not be patented. The distinction seems to be subtle, where E=mc^2 is a formula that exactly describes a property of nature, the observed correlation is an empirical estimate of a parameter calculated on the basis of noisy data. 

From a statistical perspective, there is little difference between calculating a correlation and calculating something more complicated, like the Oncotype DX signature. Both return a score that can be used to determine treatment or other health care decisions. In some sense, they are both “natural phenomena” - one is just more complicated to calculate than the other. So it is not surprising that Genomic Health, the developers of Oncotype, have filed an amicus in favor of Prometheus. 

Once a score is calculated, regardless of the level of complication in calculating that score, the personalized decision still comes down to a decision made by a doctor on the basis of a number. So if the court broadly decides in favor of Mayo, from a statistical perspective, this would seemingly pre-empt patenting any personalized medicine decision made on the basis of observing data and making a calculation. 

Unlike traditional medical procedures like surgery, or treatment with a drug, these procedures are based on data and statistics. But in the same way, a very specific set of operations and decisions is taken with the goal of improving patient health. If these procedures are broadly ruled as simply “natural phenomena”, it suggests that the development of personalized decision making strategies is not, itself, patentable. This decision would also have implications for other companies that use data and statistics to make money, like software giant SAP, which has also filed an amicus brief in support of the federal circuit court opinion (and hence Prometheus).

A large component of medical treatment in the future will likely be made on the basis of data and statistical calculations on those data - that is the goal of personalized medicine. So the Supreme Court’s decision about the patentability of correlation has seemingly huge implications for any decision made on the basis of data and statistical calculations. Regardless of the outcome, this case lends even further weight to the idea that statistical literacy is critical, including for Supreme Court justices. 

Simply Statistics will be following this case closely; look for more in depth analysis in future blog posts. 


Citizen science makes statistical literacy critical

In today’s Wall Street Journal, Amy Marcus has a piece on the Citizen Science movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big success stories. As more companies like Fitbit and 23andMe spring up, it is really easy to collect data about yourself (right Chris?). At the same time organizations like Patients Like Me make it possible for people with specific diseases or experiences to self-organize. 

But the thing that struck me the most in reading the article is the importance of statistical literacy for citizen scientists, reporters, and anyone reading these articles. For example the article says:

The questions that most people have about their DNA—such as what health risks they face and how to prevent them—aren’t always in sync with the approach taken by pharmaceutical and academic researchers, who don’t usually share any potentially life-saving findings with the patients.

I think its pretty unlikely that any organization would hide life-saving findings from the public. My impression from reading the article is that this statement refers to keeping results blinded from patients/doctors during an experiment or clinical trial. Blinding is a critical component of clinical trials, which reduces many potential sources of bias in the results of a study. Obviously, once the trial/study has ended (or been stopped early because a treatment is effective) then the results are quickly disseminated.

Several key statistical issues are then raised in bullet-point form without discussion: 

Amateurs may not collect data rigorously, they say, and may draw conclusions from sample sizes that are too small to yield statistically reliable results. 

Having individuals collect their own data poses other issues. Patients may enter data only when they are motivated, or feeling well, rendering the data useless. In traditional studies, both doctors and patients are typically kept blind as to who is getting a drug and who is taking a placebo, so as not to skew how either group perceives the patients’ progress.

The article goes on to describe an anecdotal example of citizen science - which suffers from a key statistical problem (small sample size):

Last year, Ms. Swan helped to run a small trial to test what type of vitamin B people with a certain gene should take to lower their levels of homocysteine, an amino acid connected to heart-disease risk. (The gene affects the body’s ability to metabolize B vitamins.)

Seven people—one in Japan and six, including herself, in her local area—paid around $300 each to buy two forms of vitamin B and Centrum, which they took in two-week periods followed by two-week “wash-out” periods with no vitamins at all.

The article points out the issue:

The scientists clapped politely at the end of Ms. Swan’s presentation, but during the question-and-answer session, one stood up and said that the data was not statistically significant—and it could be harmful if patients built their own regimens based on the results.

But doesn’t carefully explain the importance of sample size, suggesting instead that the only reason why you need more people is “insure better accuracy”. 

It strikes me that statistical literacy is critical if the citizen science movement is going to go forward. Ideas like experimental design, randomization, blinding, placebos, and sample size need to be in the toolbox of any practicing citizen scientist. 

One major drawback is that there are very few places where the general public can learn about statistics. Mostly statistics is taught in university courses. Resources like the Kahn Academy and the Cartoon Guide to Statistics exist, but are only really useful if you are self motivated and have some idea of math/statistics to begin with. 

Since knowledge of basic statistical concepts is quickly becoming indispensable for citizen science or even basic life choices like deciding on healthcare options, do we need “adult statistical literacy courses”? These courses could focus on the basics of experimental design and how to understand results in stories about science in the popular press. It feels like it might be time to add a basic understanding of statistics and data to reading/writing/arithmetic as critical life skills. I’m not the only one who thinks so.