Simply Statistics


Welcome to the Smog-ocalypse

Beijing fog, 2013

Recent reports of air pollution levels out of Beijing are very very disturbing. Levels of fine particulate matter (PM2.5, or PM less than 2.5 microns in diameter) have reached unprecedented levels. So high are the levels that even the official media are allowed to mention it.

Here is a photograph of downtown Beijing during the day (Thanks to Sarah E. Burton for the photograph). Hourly levels of PM2.5 hit over 900 micrograms per cubic meter in some parts of the city and 24-hour average levels (the basis for most air quality standards) reached over 500 micrograms per cubic meter. Just for reference, the US national ambient air quality standard for the 24-hour average level of PM2.5 is 35 micrograms per cubic meter.

Below is a plot of the PM2.5 data taken from the US Embassy's rooftop monitor.


The solid circles indicate the 24-hour average for the day. The red line is the median of the daily averages for the time period in the plot (about 6 weeks) and the dotted blue line is the US 24-hour national ambient air quality standard. The median for the period was about 91 micrograms per cubic meter.

First, it should be noted that a "typical" day of 91 micrograms per cubic meter is still crazy. But suppose we take 91 to be a typical day. Then in a city like Beijing, which has about 20 million people, if we assume that about 700 people die on a typical day, then the last 5 days alone would experience about 307 excess deaths from all causes. I get this from using a rough estimate of a 0.3% increase in all-cause mortality per 10 microgram per cubic meter increase in PM2.5 levels (studies from China and the US tend to report risks in roughly this area). The 700 deaths per day number is a fairly back-of-the-envelope number that I got simply using comparisons to other major cities.  Numbers for things like excess hospitalizations will be higher because both the risks and the baselines are higher. For example, in the US, we estimate about a 1.28% increase in heart failure hospitalization for a 10 microgram per cubic meter increase in PM2.5.

If you like, you can also translate current levels to numbers of cigarettes smoked. If you assume a typical adult inhales about 18-20 cubic meters of air per day, then in the last 5 days, the average Beijinger smoked about 3 cigarettes just by getting out of bed in the morning.

Lastly, I want to point to a nice series of photos that the Guardian has collected on the (in)famous London Fog of 1952. Although the levels were quite a bit worse back then (about 2-3 times worse, if you can believe it), the photos bear a striking resemblance to today's Beijing.

At least in the US, the infamous smog episodes that occurred regularly only 60 years ago are pretty much non-existent. But in many places around the world, "crazy bad" air pollution is part of everyday life.


Sunday data/statistics link roundup (1/13/2013)

  1. These are some great talks. But definitely watch Michael Eisen's talk on E-biomed and the history of open access publication. This is particularly poigniant in light of Aaron Swartz's tragic suicide. It's also worth checking out the twitter hashtag #pdftribute .
  2. An awesome flowchart before a talk given by the creator of the R twotuorials. Roger gets a shoutout (via civilstat).
  3. This blog selects a position at random on the planet earth every day and posts the picture taken closest to that point. Not much about the methodology on the blog, but totally fascinating and a clever idea.
  4. A set of data giving a "report card" for each state on how that state does in improving public education for students. I'm not sure I believe the grades, but the underlying reports look interesting.

NSF should understand that Statistics is not Mathematics

NSF has realized that the role of Statistics is growing in all areas of science and engineering and has formed a subcommittee to examine the current structure of support of the statistical sciences.  As Roger explained in August, the NSF is divided into directorates composed of divisions. Statistics is in the Division of Mathematical Sciences (DMS) within the Directorate for Mathematical and Physical Sciences. Within this division it is a Disciplinary Research Program along with Topology, Geometric Analysis, etc.. To statisticians this does not make much sense, and my first thought when asked for recommendations was that we need a proper division. But the committee is seeking out recommendations that

[do] not include renaming of the Division of Mathematical Sciences. Particularly desired are recommendations that can be implemented within the current divisional and directorate structure of NSF; Foundation (NSF) and to provide recommendations for NSF to consider.

This clarification is there because former director Sastry Pantula suggested DMS change names to "Division of Mathematical and Statistical Sciences”.  The NSF shot down this idea and listed this as one of the reasons:

If the name change attracts more proposals to the Division from the statistics community, this could draw funding away from other subfields

So NSF does not want to take away from the other math programs and this is understandable given the current levels of research funding for Mathematics. But this being the case, I can't really think of a recommendation other than giving Statistics it's own division or give data related sciences their own directorate. Increasing support for the statistical sciences means increasing funding. You secure the necessary funding either by asking congress for a bigger budget (good luck with that) or by cutting from other divisions, not just Mathematics. A new division makes sense not only in practice but also in principle because Statistics is not Mathematics.

Statistics is analogous to other disciplines that use mathematics as a fundamental language, like Physics, Engineering, and Computer Science. But like those disciplines, Statistics contributes separate and fundamental scientific knowledge. While the field of applied mathematics tries to explain the world with deterministic equations, Statistics takes a dramatically different approach. In highly complex systems, such as the weather, Mathematicians battle LaPlace's demon and struggle to explain nature using mathematics derived from first principles. Statisticians accept  that deterministic approaches are not always useful and instead develop and rely on random models. These two approaches are both important as demonstrated by the improvements in meteorological predictions  achieved once data-driven statistical models were used to compliment deterministic mathematical models.

Although Statisticians rely heavily on theoretical/mathematical thinking, another important distinction from Mathematics is that advances in our field are almost exclusively driven by empirical work. Statistics always starts with a specific, concrete real world problem: we thrive in Pasteur's quadrant. Important theoretical work that generalizes our solutions always follows. This approach, built mostly by basic researchers, has yielded some of the most useful concepts relied upon by modren science: the p-value, randomization, analysis of variance, regression, the proportional hazards model, causal inference, Bayesian methods, and the Bootstrap, just to name a few examples. These ideas were instrumental in the most important genetic discoveries, improving agriculture, the inception of the empirical social sciences, and revolutionizing medicine via randomized clinical trials. They have also fundamentally changed the way we abstract quantitative problems from real data.

The 21st century brings the era of big data, and distinguishing Statistics from Mathematics becomes more important than ever.  Many areas of science are now being driven by new measurement technologies. Insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Although testing hypotheses developed theoretically will of course remain important to science, it is inconceivable to think that, just like Leeuwenhoek became the father of microbiology by looking through the microscope without theoretical predictions, the era of big data will enable discoveries that we have not yet even imagined. However, it is naive to think that these new datasets will be free of noise and unwanted variability. Deterministic models alone will almost certainly fail at extracting useful information from these data just like they have failed at predicting complex systems like the weather. The advancement in science during the era of big data that the NSF wants to see will only happen if the field that specializes in making sense of data is properly defined as a separate field from Mathematics and appropriately supported.

Addendum: On a related note, NIH just announced that they plan to recruit a new senior scientific position: the Associate Director for Data Science


The landscape of data analysis

I have been getting some questions via email, LinkedIn, and Twitter about the content of the Data Analysis class I will be teaching for Coursera. Data Analysis and Data Science mean different things to different people. So I made a video describing how Data Analysis fits into the landscape of other quantitative classes here:

Here is the corresponding presentation. I also made a tentative list of topics we will cover, subject to change at the instructor's whim. Here it is:

  • The structure of a data analysis  (steps in the process, knowing when to quit, etc.)
  • Types of data (census, designed studies, randomized trials)
  • Types of data analysis questions (exploratory, inferential, predictive, etc.)
  • How to write up a data analysis (compositional style, reproducibility, etc.)
  • Obtaining data from the web (through downloads mostly)
  • Loading data into R from different file types
  • Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
  • Exploratory statistical models (clustering)
  • Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
  • Basic model checking (primarily visually)
  • The prediction process
  • Study design for prediction
  • Cross-validation
  • A couple of simple prediction models
  • Basics of simulation for evaluating models
  • Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)

Of course that is a ton of material for 8 weeks and so obviously we will be covering just the very basics. I think it is really important to remember that being a good Data Analyst is like being a good surgeon or writer. There is no such thing as a prodigy in surgery or writing, because it requires long experience, trying lots of things out, and learning from mistakes. I hope to give people the basic information they need to get started and point to resources where they can learn more. I also hope to give them a chance to practice a couple of times some basics and to learn that in data analysis the first goal is to "do no harm".


By introducing competition open online education will improve teaching at top universities

It is no secret that faculty evaluations at top universities weigh research much more than teaching. This is not surprising given that, among other reasons,  global visibility comes from academic innovation (think Nobel Prizes) not classroom instruction. Come promotion time the peer review system carefully examines your publication record and ability to raise research funds. External experts within your research area are asked if you are a leader in the field. Top universities maintain their status by imposing standards that lead to a highly competitive environment in which only the most talented researchers survive.

However, the assessment of teaching excellence is much less stringent. Unless they reveal utter incompetence, teaching evaluations are practically ignored; especially if you have graduated numerous PhD students. Certainly, outside experts are not asked about your teaching. This imbalance in incentives explains why faculty use research funding to buy-out of teaching and why highly recruited candidates negotiate low teaching loads.

Top researchers end up at top universities but being good at research does not necessarily mean you are a good teacher. Furthermore,  the effort required to be a competitive researcher leaves limited time for class preparation. To make matters worse, within a university, faculty have a monopoly on the classes they teach. With few incentives and  practically no competition it is hard to believe that top universities are doing the best they can when it comes to classroom instruction. By introducing competition, MOOCs might change this.

To illustrate, say you are a chair of a soft money department in 2015. Four of your faculty receive 25% funding to teach the big Stat 101 class and your graduate program's three main classes. But despite being great researchers these four are mediocre teachers. So why are they teaching if 1) a MOOC exists for each of these classes and 2) these professors can easily cover 100% of their salary with research funds. As chair, not only do you wonder why not let these four profs  focus on what they do best, but also why your department is not creating MOOCs and getting global recognition for it. So instead of hiring 4 great researchers that are mediocre teachers why not hire (for the same cost) 4 great researchers (fully funded by grants) and 1 great teacher (funded with tuition $)? I think in the future tenure track positions will be divided into top researchers doing mostly research and top teachers doing mostly classroom teaching and MOOC development. Because top universities will feel the pressure to compete and develop the courses that educate the world, there will be no room for mediocre teaching.



Sunday data/statistics link roundup (1/6/2013)

  1. Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I've ever seen, with a heatmap of traffic density to try to influence commuters.
  2. An interesting paper that talks about how clustering is only a really hard problem when there aren't obvious clusters. I was a little disappointed in the paper, because it defines the "obviousness" of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.
  3. A post about the two cultures of statistical learning and a related post on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that the only good data science is good science - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won't change to tackle the realistic challenges posed in solving modern scientific problems.
  4. A graph that shows the incredibly strong correlation ( > 0.99!) between the growth of autism diagnoses and organic food sales. Another example where even really strong correlation does not imply causation.
  5. The Buffalo Bills are going to start an advanced analytics department (via Rafa and Chris V.), maybe they can take advantage of all this free play-by-play data from years of NFL games.
  6. A prescient interview with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).
  7. The statistical software signal - what your choice of software says about you. Just another reason we need a deterministic statistical machine.



Does NIH fund innovative work? Does Nature care about publishing accurate articles?

Editor's Note: In a recent post we disagreed with a Nature article claiming that NIH doesn't support innovation. Our colleague Steven Salzberg actually looked at the data and wrote the guest post below. 

Nature published an article last month with the provocative title "Research grants: Conform and be funded."  The authors looked at papers with over 1000 citations to find out whether scientists "who do the most influential scientific work get funded by the NIH."  Their dramatic conclusion, widely reported, was that only 40% of such influential scientists get funding.

Dramatic, but wrong.  I re-analyzed the authors' data and wrote a letter to Nature, which was published today along with the authors response, which more or less ignored my points.  Unfortunately, Nature cut my already-short letter in half, so what readers see in the journal omits half my argument.  My entire letter is published here, thanks to my colleagues at Simply Statistics.  I titled it "NIH funds the overwhelming majority of highly influential original science results," because that's what the original study should have concluded from their very own data.  Here goes:

To the Editors:

In their recent commentary, "Conform and be funded," Joshua Nicholson and John Ioannidis claim that "too many US authors of the most innovative and influential papers in the life sciences do not receive NIH funding."  They support their thesis with an analysis of 200 papers sampled from 700 life science papers with over 1,000 citations.  Their main finding was that only 40% of "primary authors" on these papers are PIs on NIH grants, from which they argue that the peer review system "encourage[s] conformity if not mediocrity."

While this makes for an appealing headline, the authors' own data does not support their conclusion.  I downloaded the full text for a random sample of 125 of the 700 highly cited papers [data available upon request].  A majority of these papers were either reviews (63), which do not report original findings, or not in the life sciences (17) despite being included in the authors' database.  For the remaining 45 papers, I looked at each paper to see if the work was supported by NIH.  In a few cases where the paper did not include this information, I used the NIH grants database to determine if the corresponding author has current NIH support.  34 out of 45 (75%) of these highly-cited papers were supported by NIH.  The 11 papers not supported included papers published by other branches of the U.S. government, including the CDC and the U.S. Army, for which NIH support would not be appropriate.  Thus, using the authors' own data, one would have to conclude that NIH has supported a large majority of highly influential life sciences discoveries in the past twelve years.

The authors – and the editors at Nature, who contributed to the article – suffer from the same biases that Ioannidis himself has often criticized.  Their inclusion of inappropriate articles and especially the choice to require that both the first and last author be PIs on an NIH grant, even when the first author was a student, produced an artificially low number that misrepresents the degree to which NIH supports innovative original research.

It seems pretty clear that Nature wanted a headline about how NIH doesn't support innovation, and Ioannidis was happy to give it to them.  Now, I'd love it if NIH had the funds to support more scientists, and I'd also be in favor of funding at least some work retrospectively - based on recent major achievements, for example, rather than proposed future work.  But the evidence doesn't support the "Conform and be funded" headline, however much Nature might want it to be true.


The scientific reasons it is not helpful to study the Newtown shooter's DNA

The Connecticut Medical Examiner has asked to sequence and study the DNA of the recent Newtown shooter. I've been seeing this pop up over the last few days on a lot of popular media sites, where they mention some objections scientists (or geneticists) may have to this "scientific" study. But I haven't seen the objections explicitly laid out anywhere. So here are mine.

Ignoring the fundamentals of the genetics of complex disease: If the violent behavior of the shooter has any genetic underpinning, it is complex. If you only look at one person's DNA, without a clear behavior definition (violent? mental disorder? etc.?) it is impossible to assess important complications such as penetranceepistasis, and gene-environment interactions, to name a few. These make statistical analysis incredibly complicated even in huge, well-designed studies.

Small Sample Size:  One person hit on the issue that is maybe the biggest reason this is a waste of time/likely to lead to incorrect results. You can't draw a reasonable conclusion about any population by looking at only one individualThis is actually a fundamental component of statistical inference. The goal of statistical inference is to take a small, representative sample and use data from that sample to say something about the bigger population. In this case, there are two reasons that the usual practice of statistical inference can't be applied: (1) only one individual is being considered, so we can't measure anything about how variable (or accurate) the data are, and (2) we've picked one, incredibly high-profile, and almost certainly not representative, individual to study.

Multiple testing/data dredging: The small sample size problem is compounded by the fact that we aren't looking at just one or two of the shooter's genes, but rather the whole genome. To see why making statements about violent individuals based on only one person's DNA is a bad idea, think about the 20,000 genes in a human body. Let's suppose that only one of the genes causes violent behavior (it is definitely more complicated than that) and that there is no environmental cause to the violent behavior (clearly false). Furthermore, suppose that if you have the bad version of the violent gene you will do something violent in your life (almost definitely not a sure thing).

Now, even with all these simplifying (and incorrect) assumptions for each gene you flip a coin with a different chance of being heads. The violent gene turned up tails, but so did a large number of other genes. If we compare the set of genes that came up tails to another individual, they will have a huge number in common in addition to the violent gene. So based on this information, you would have no idea which gene causes violence even in this hugely simplified scenario.

Heavy reliance on prior information/intuition: This is a supposedly scientific study, but the small sample size/multiple testing problems mean any conclusions from the data will be very very weak. The only thing you could do is take the set of genes you found and then rely on previous studies to try to determine which one is the "violence gene". But now you are being guided by intuition, guesswork, and a bunch of studies that may or may not be relevant. The result is that more than likely you'd end up on the wrong gene.

The result is that it is highly likely that no solid statistical information will be derived from this experiment. Sometimes, just because the technology exists to run an experiment, doesn't mean that experiment will teach us anything.


Fitbit, why can't I have my data?

I have a Fitbit. I got it because I wanted to collect some data about myself and I liked the simplicity of the set-up. I also asked around and Fitbit seemed like the most "open" platform for collecting one's own data. You have to pay $50 for a premium account, but after that, they allow you to download your data.

Or do they?

I looked into the details, asked a buddy or two, and found out that you actually can't get the really interesting minute-by-minute data even with a premium account. You only get the daily summarized totals for steps/calories/stairs climbed. While this data is of some value, the minute-by-minute data are oh so much more interesting. I'd like to use it for personal interest, for teaching, for research, and for sharing interesting new ideas back to other Fitbit developers.

Since I'm not easily dissuaded, I tried another route. I created an application that accessed the Fitbit API. After fiddling around a bit with a few R packages, I was able to download my daily totals. But again, no minute-by-minute data. I looked into it and only Fitibit Partners have access to the intraday data. So I emailed Fitbit to ask if I could be a partner app. So far no word.

I guess it is true, if you aren't paying for it, you are the product. But honestly, I'm just not that interested in being a product for Fitbit. So I think I'm bailing until I can download intraday data - I'm even happy to pay for it. If anybody has a suggestion of a more open self-monitoring device, I'd love to hear about it.


Happy 2013: The International Year of Statistics

The ASA has declared 2013 to be the International Year of Statistics and I am ready to celebrate it in full force. It is a great time to be a statistician and I am hoping more people will join the fun. In fact, as we like to point out in this blog, Statistics has already been at the center of many exciting accomplishments of the 21st century. Sabermetrics  has become a standard approach and inspired the Hollywood movie Money Ball. Friend of the blog Chris Volinsky, a PhD Statistician, led the team that won the Netflix million dollar prize. Nate Silver et al. proved the pundits wrong by, once again, using statistical models to predict election results almost perfectly. R has become one the most widely used programming languages in the world. Meanwhile, in academia, the number of statisticians becoming leaders in fields like environmental sciences, human genetics, genomics, and social sciences continues to grow. It is no surprise that stats majors at Harvard have more than quadrupled since 2000 and that statistics MOOCs are among the most popular.

The unprecedented advances in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming the world. Many areas of science are now being driven by new measurement technologies and many insights are being made by discovery-driven, as opposed to hypothesis-driven, experiments. Empiricism is back with a vengeance. The current scientific era is defined by its dependence on data and the statistical methods and concepts developed during the 20th century provide an incomparable toolbox to help tackle current challenges. The toolbox, along with computer science, will also serve as a base for the methods of tomorrow.  So I will gladly join the Year of Statistics' festivities during 2013 and beyond, during the era of data-driven science.