Simply Statistics


Sunday Data/Statistics Link Roundup (9/16/12)

  1. There has been a lot of talk about the Michael Lewis (of Moneyball fame) profile of Obama in Vanity fair. One interesting quote I think deserves a lot more discussion is: “On top of all of this, after you have made your decision, you need to feign total certainty about it. People being led do not want to think probabilistically.” This is a key issue that is only going to get worse going forward. All of public policy is probabilistic - we are even moving to clinical trials to evaluate public policy
  2. It’s sort of amazing to me that I hadn’t heard about this before, but a UC Davis professor was threatened for discussing the reasons PSA screening may be overused. This same issue keeps coming up over and over - screening healthy populations for rare diseases is often not effective (you need a ridiculously high specificity or a treatment with almost no side effects). What we need is John McGready to do a claymation public service video or something explaining the reasons screening might not be a good idea to the general public. 
  3. A bleg - I sometimes have a good week finding links myself and there are a few folks who regularly send links (Andrew J., Alex N., etc.) I’d love it if people would send me cool links when they see them with the email title, “Sunday LInks” - i’m sure there is more cool stuff out there. 
  4. The ICSB has a competition to improve the coverage of computational biology on Wikipedia. Someone should write a surrogate variable analysis or robust multiarray average article. 
  5. I had not hear of the ASA’s Stattrak until this week, it looks like there are some useful resources there for early career statisticians. With the onset of fall, it is closing in on a new recruiting season. If you are a postdoc/student on the job market and you haven’t read Rafa’s post on soft vs. hard money, now is the time to start brushing up! Stay tuned for more job market posts this fall from Simply Statistics. 

The statistical method made me lie

There’s a hubbub brewing over a recent study published in the Annals of Internal Medicine that compares organic food (as in ‘USDA Organic’) to non-organic food. The study, titled “Are Organic Foods Safer or Healthier Than Conventional Alternatives?A Systematic Review” is a meta-analysis of about 200 previous studies. Their conclusion, which I have cut-and-pasted below, is

The published literature lacks strong evidence that organic foods are significantly more nutritious than conventional foods. Consumption of organic foods may reduce exposure to pesticide residues and antibiotic-resistant bacteria.

When I first heard about this study on the radio, I thought the conclusion seemed kind of obvious. It’s not clear to me why, for example, an organic carrot would have more calcium than a non-organic carrot. At least, I couldn’t explain the mechanism by which this would happen. However, I would expect that an organic carrot would have less pesticide residue than a non-organic carrot. If not, then the certification isn’t really achieving its goals. Lo and behold, that’s more or less what the study found. I don’t see the controversy.

But there’s a petition over at titled “Retract the Flawed ‘Organic Study’ Linked to Big Tobacco and Pro-GMO Corps”. It’s quite an interesting read. First, it’s worth noting that the study itself does not list any funding sources. Given that the authors are from Stanford, one could conclude that therefore Stanford funded the study. The petition claims that Stanford has “deep financial ties to Cargill”, a large agribusiness company, but does not get into specifics.

More interesting is that the petition highlights the involvement in the study of Ingram Olkin, a renowned statistician at Stanford. The petition says

The study was authored by the very many [sic] who invented a method of ‘lying with statistics’. Olkin worked with Stanford University to develop a “multivariate” statistical algorithm, which is essentially a way to lie with statistics.

That’s right, the statistical method made them lie!

The petition is ridiculous. Interestingly, even as the petition claims conflict of interest on the part of the study authors, it seems one of the petition authors, Anthony Gucciardi, is “a natural health advocate, and creator of the health news website NaturalSociety” according to his Twitter page. Go figure. It worries me that people would claim the mere use of statistical methods is sufficient grounds for doubt. It also worries me that 3,386 people (as of this writing) would blindly agree.

By the way, can anyone propose an alternative to “multivariate statistics”? I need stop all this lying….



An experimental foundation for statistics

In a recent conversation with Brian (of abstraction fame) about the relationship between mathematics and statistics. Statistics, for historical reasons, has been treated as a mathematical sub-discipline (this is the NSF’s view).

One reason statistics is viewed as a sub-discipline of math is because the foundations of statistics are built on the basis of deductive reasoning, where you start with a few general propositions or foundations that you assume to be true and then systematically prove more specific results. A similar approach is taken in most mathematical disciplines. 

In contrast, scientific disciplines like biology are largely built on the basis of inductive reasoning and the scientific method. Specific individual discoveries are described and used as a framework for building up more general theories and principles. 

So the question Brian and I had was: what if you started over and built statistics from the ground up on the basis of inductive reasoning and experimentation? Instead of making mathematical assumptions and then proving statistical results, you would use experiments to identify core principals. This actually isn’t without precedent in the statistics community. Bill Cleveland and Robert McGill studied how people perceive graphical information and produced some general recommendations about the use of area/linear contrasts, common axes, etc. There has also been a lot of work on experimental understanding of how humans understand uncertainty

So what if we put statistics on an experimental, rather than on a mathematical foundation. We performed experiments to see what kind of regression models people were able to interpret most clearly, what were the best ways to evaluate confounding/outliers, or what measure of statistical significance people understood best? Basically, what if the “quality” of a statistical method did not rest on the mathematics behind the method, but on the basis of experimental results demonstrating how people used the methods? So, instead of justifying lowess mathematically, we justified it on the basis of its practical usefulness through specific, controlled experiments. Some of this is already happening when people do surveys of the most successful methods in Kaggle contests or with the MAQC.

I wonder what methods would survive the change in paradigm?


The pebbles of academia

I have just been awarded a certificate for successful completion of the Conflict of Interest Commitment training (I barely passed). Lately, I have been totally swamped by administrative duties and have had little time for actual research. The experience reminded me of something I read in this NYTimes article by Tyler Cowen

Michael Mandel, an economist with the Progressive Policy Institute, compares government regulation of innovation to the accumulation of pebbles in a stream. At some point too many pebbles block off the water flow, yet no single pebble is to blame for the slowdown. Right now the pebbles are limiting investment in future innovation.

Here are some of the pebbles of my academic career (past and present): financial conflict of interest training , human subjects training, HIPAA training, safety training, ethics training, submitting papers online, filling out copyright forms, faculty meetings, center grant quarterly meetings, 2 hour oral exams, 2 hour thesis committee meetings, big project conference calls, retreats, JSM, anything with “strategic” in the title, admissions committee, affirmative action committee, faculty senate meetings, brown bag lunches, orientations, effort reporting, conflict of interest reporting, progress reports (can’t I just point to pubmed?), dbgap progress reports, people who ramble at study section, rambling at study section, buying airplane tickets for invited talks, filling out travel expense sheets, and organizing and turning in travel receipts. I know that some of these are somewhat important or take minimal time, but read the quote again.

I also acknowledge that I actually have it real easy compared to others so I am interested in hearing about other people’s pebbles? 

Update: add changing my eRA commons password to list!


Sunday Data/Statistics Link Roundup (9/9/12)

  1. Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to “how to google” (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. 
  2. A really nice presentation on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?
  3. We talked a lot about the ENCODE project this week. We had an interview with Steven Salzberg, then Rafa followed it up with a discussion of top-down vs. bottom-up science. Tons of data from the ENCODE project is now available, there is even a virtual machine with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “that’s a clown post, bro”.
  4. Another breathless post from the Chronicle about how there are “dozens of plagiarism cases being reported on Coursera”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in my course, hopefully plagiarism software will be in place by then. 
  5. A New York Times article about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau says in his book. This one may just be targeted at scientists/engineers. (link via Dan S.)
  6. Edo and co. are putting together a workshop on the analysis of social network data for NIPS in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!