Simply Statistics

08
Dec

Plotting BeijingAir Data

Here’s a bit of R code for scraping the BejingAir Twitter feed and plotting the hourly PM2.5 values for the past 24 hours. The script defaults to the past 24 hours but you can modify that by simply changing the value for the variable ‘n’. 

You can just grab the code from this R script. Note that you need to use the latest version of the ‘twitteR’ package because the data structure has changed from previous versions.

Using a modified version of the code in the script, I made a plot of the 24-hour average PM2.5 levels in Beijing over the last 2 months or so. The dashed line shows the US national ambient air quality standard for 24-hour average PM2.5. Note that the plot below is 24-hour averages so it is comparable to the US standard and also looks (somewhat) less extreme than the hourly values.

07
Dec
07
Dec
06
Dec

Beijing Air (cont'd)

Following up a bit on my previous post on air pollution in Beijing, China, my brother forwarded me a link to some work conducted by Steven Q. Andrews on comparing particulate matter (PM) air pollution in China versus Europe and the US. China does not officially release fine PM measurements (PM2.5) and furthermore does not have an official standard for that metric. In the US, PM standards are generally focused on PM2.5 now as opposed to PM10 (which includes coarse thoracic particles). Apparently, China is proposing a standard for PM2.5 but it has not yet been implemented.

The main issue seems to be that China has a somewhat different opinion about what it means to be a “bad” pollution day. In the US, the daily average national ambient air quality standard for PM2.5 is 35 mcg/m^3, whereas the proposed standard in China is 75 mcg/m^3. The WHO recommends PM2.5 levels be below 25 mcg/m^3. In China, days under 35 would be considered “excellent” and days under 75 would be considered “good”.

It’s a bit difficult to understand what this means here because in the US we so rarely see days where the daily average is above 75 mcg/m^3. In fact, for the period 1999-2008, if you look across the entire PM2.5 monitoring network for the US, you see that 99% of days fell below the level of 75 mcg/m^3. So seeing a day like that would be quite a rare event indeed.

The Chinese government has consistently claimed that air pollution has improved over time. But Andrews notes

…these so-called improvements are due to irregularities in the monitoring and reporting of air quality – and not to less polluted air. Most importantly, the government changed monitoring station locations twice. In 2006, it shut down the two most polluted stations and then, in 2008, began monitoring outside the city, beyond the sixth ring road, which is 15 to 20 kilometres from Beijing’s centre.

Andrews has previously published on inconsistencies between Beijing’s claims of “blue sky days” and the actual monitoring of PM in a paper in Environmental Research Letters. That paper showed an unusually high number of measurements falling just below the cutoff for a “blue sky day”. The reason for this pattern is not clear but it raises questions about the quality of the official monitoring data.

China has a novel propagandistic approach to air pollution regulation, which is to separate the data from the interpretation. So a day that has PM2.5 levels at 75 mcg/m^3 is called “good” and as long as you have a lot of “good” or “excellent” days, then you are set. The problem is that you can call something a “blue sky day” or whatever you want, but people still have to suffer the real consequences of high PM days. It’s hard to “relabel” increased asthma attacks, irritated respiratory tracts, and myocardial infarctions.

Andrews notes

As the China Daily recently wrote: “All of the residents in the city are aware of the poor air quality, so it does not make sense to conceal it for fear of criticism.”

Maybe the best way to conceal the air pollution is to actually get rid of it?

06
Dec
05
Dec

Preventing Errors through Reproducibility

Checklist mania has hit clinical medicine thanks to people like Peter Pronovost and many others. The basic idea is that simple and short checklists along with changes to clinical culture can prevent major errors from occurring in medical practice. One particular success story is Pronovost’s central line checklist which dramatically reduced bloodstream infections in hospital intensive care units.  

There are three important points about the checklist. First, it neatly summarizes information, bringing the latest evidence directly to clinical practice. It is easy to follow because it is short. Second, it serves to slow you down from whatever you’re doing. Before you cut someone open for surgery, you stop for a second and run the checklist. Third, it is a kind of equalizer that subtly changes the culture: everyone has to follow the checklist, no exceptions. A number of studies have now shown that when clinical units follow checklists, infection rates go down and hospital stays are shorter compared to units using standard procedures. 

Here’s a question: What would it take to convince you that an article’s results were reproducible, short of going in and reproducing the results yourself? I recently raised this question in a talk I gave at the Applied Mathematics Perspectives conference. At the time I didn’t get any responses, but I’ve had some time to think about it since then.

I think most people are thinking of this issue along the lines of “The only way I can confirm that an analysis is reproducible is to reproduce it myself”. In order for that to work, everyone needs to have the data and code available to them so that they can do their own independent reproduction. Such a scenario would be sufficient (and perhaps ideal) to claim reproducibility, but is it strictly necessary? For example, if I reproduced a published analysis, would that satisfy you that the work was reproducible, or would you have to independently reproduce the results for yourself? If you had to choose someone to reproduce an analysis for you (not including yourself), who would it be?

This idea is embedded in the reproducible research policy at Biostatistics, but of course we make the data and code available too. There, a (hopefully) trusted third party (the Associate Editor for Reproducibility) reproduces the analysis and confirms that the code was runnable (at least at that moment in time). 

It’s important to point out that reproducible research is not only about correctness and prevention of errors. It’s also about making research results available to others so that they may more easily build on the work. However, preventing errors is an important part and the question is then what is the best way to do that? Can we generate a reproducibility checklist?

05
Dec
03
Dec

Citizen science makes statistical literacy critical

In today’s Wall Street Journal, Amy Marcus has a piece on the Citizen Science movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big success stories. As more companies like Fitbit and 23andMe spring up, it is really easy to collect data about yourself (right Chris?). At the same time organizations like Patients Like Me make it possible for people with specific diseases or experiences to self-organize. 

But the thing that struck me the most in reading the article is the importance of statistical literacy for citizen scientists, reporters, and anyone reading these articles. For example the article says:

The questions that most people have about their DNA—such as what health risks they face and how to prevent them—aren’t always in sync with the approach taken by pharmaceutical and academic researchers, who don’t usually share any potentially life-saving findings with the patients.

I think its pretty unlikely that any organization would hide life-saving findings from the public. My impression from reading the article is that this statement refers to keeping results blinded from patients/doctors during an experiment or clinical trial. Blinding is a critical component of clinical trials, which reduces many potential sources of bias in the results of a study. Obviously, once the trial/study has ended (or been stopped early because a treatment is effective) then the results are quickly disseminated.

Several key statistical issues are then raised in bullet-point form without discussion: 

Amateurs may not collect data rigorously, they say, and may draw conclusions from sample sizes that are too small to yield statistically reliable results. 

Having individuals collect their own data poses other issues. Patients may enter data only when they are motivated, or feeling well, rendering the data useless. In traditional studies, both doctors and patients are typically kept blind as to who is getting a drug and who is taking a placebo, so as not to skew how either group perceives the patients’ progress.

The article goes on to describe an anecdotal example of citizen science - which suffers from a key statistical problem (small sample size):

Last year, Ms. Swan helped to run a small trial to test what type of vitamin B people with a certain gene should take to lower their levels of homocysteine, an amino acid connected to heart-disease risk. (The gene affects the body’s ability to metabolize B vitamins.)

Seven people—one in Japan and six, including herself, in her local area—paid around $300 each to buy two forms of vitamin B and Centrum, which they took in two-week periods followed by two-week “wash-out” periods with no vitamins at all.

The article points out the issue:

The scientists clapped politely at the end of Ms. Swan’s presentation, but during the question-and-answer session, one stood up and said that the data was not statistically significant—and it could be harmful if patients built their own regimens based on the results.

But doesn’t carefully explain the importance of sample size, suggesting instead that the only reason why you need more people is “insure better accuracy”. 

It strikes me that statistical literacy is critical if the citizen science movement is going to go forward. Ideas like experimental design, randomization, blinding, placebos, and sample size need to be in the toolbox of any practicing citizen scientist. 

One major drawback is that there are very few places where the general public can learn about statistics. Mostly statistics is taught in university courses. Resources like the Kahn Academy and the Cartoon Guide to Statistics exist, but are only really useful if you are self motivated and have some idea of math/statistics to begin with. 

Since knowledge of basic statistical concepts is quickly becoming indispensable for citizen science or even basic life choices like deciding on healthcare options, do we need “adult statistical literacy courses”? These courses could focus on the basics of experimental design and how to understand results in stories about science in the popular press. It feels like it might be time to add a basic understanding of statistics and data to reading/writing/arithmetic as critical life skills. I’m not the only one who thinks so.


03
Dec
03
Dec

Reverse scooping

I would like to define a new term: reverse scooping is when someone publishes your idea after you, and doesn’t cite you. It has happened to me a few times. What does one do? I usually send a polite message to the authors with a link to my related paper(s). These emails are usually ignored, but not always. Most times I don’t think it is malicious though. In fact, I almost reverse scooped a colleague recently.  People arrive at the same idea a few months (or years) later and there is just too much literature to keep track-off. And remember the culprit authors were not the only ones that missed your paper, the referees and associate editor missed it as well. One thing I have learned is that if you want to claim an idea, try to include it in the title or abstract as very few papers get read cover-to-cover.