Simply Statistics


Battling Bad Science

Here is a pretty awesome TED talk by epidemiologist Ben Goldacre where he highlights how science can be used to deceive/mislead. It’s sort of like epidemiology 101 in 15 minutes. 

This seems like a highly topical talk. Over on his blog, Steven Salzberg has pointed out that Dr. Oz has recently been engaging in some of these shady practices on his show. Too bad he didn’t check out the video first. 

In the comments section of the TED talk, one viewer points out that Dr. Goldacre doesn’t talk about the role of the FDA and other regulatory agencies. I think that regulatory agencies are under-appreciated and deserve credit for addressing many of these potential problems in the conduct of clinical trials. 

Maybe there should be an agency regulating how science is reported in the news? 


Why does Obama need statisticians?

It’s worth following up a little on why the Obama campaign is recruiting statisticians (note to Karen: I am not looking for a new job!). Here’s the blurb for the position of “Statistical Modeling Analyst”:

The Obama for America Analytics Department analyzes the campaign’s data to guide election strategy and develop quantitative, actionable insights that drive our decision-making. Our team’s products help direct work on the ground, online and on the air. We are a multi-disciplinary team of statisticians, mathematicians, software developers, general analysts and organizers - all striving for a single goal: re-electing President Obama. We are looking for staff at all levels to join our department from now through Election Day 2012 at our Chicago, IL headquarters.

Statistical Modeling Analysts are charged with predicting electoral outcomes using statistical models. These models will be instrumental in helping the campaign determine how to most effectively use its resources.

I wonder if there’s a bonus for predicting the correct outcome, win or lose?

The Obama campaign didn’t invent the idea of heavy data analysis in campaigns, but they seem to be heavy adopters. There are 3 openings in the “Analytics” category as of today.

Now, can someone tell me why they don’t just call it simply “Statistics”?


Kindle Fire and Machine Learning

Amazon released it’s new iPad competitor, the Kindle Fire, today. A quick read through the description shows it has some interesting features, including a custom-built web browser called Silk. One innovation that they claim is that the browser works in conjunction with Amazon’s EC2 cloud computing platform to speed up the web-surfing experience by doing some computing on your end and some on their end. Seems cool, if it really does make things faster.

Also there’s this interesting bit:

Machine Learning

Finally, Silk leverages the collaborative filtering techniques and machine learning algorithms Amazon has built over the last 15 years to power features such as “customers who bought this also bought…” As Silk serves up millions of page views every day, it learns more about the individual sites it renders and where users go next. By observing the aggregate traffic patterns on various web sites, it refines its heuristics, allowing for accurate predictions of the next page request. For example, Silk might observe that 85 percent of visitors to a leading news site next click on that site’s top headline. With that knowledge, EC2 and Silk together make intelligent decisions about pre-pushing content to the Kindle Fire. As a result, the next page a Kindle Fire customer is likely to visit will already be available locally in the device cache, enabling instant rendering to the screen.

That seems like a logical thing for Amazon to do. While the idea of pre-fetching pages is not particularly new, I haven’t yet heard of the idea of doing data analysis on web pages to predict which things to pre-fetch. One issue this raises in my mind, is that in order to do this, Amazon needs to combine information across browsers, which means your surfing habits will become part of one large mega-dataset. Is that what we want?

On the one hand, Amazon already does some form of this by keeping track of what you buy. But keeping track of every web page you goto and what links you click on seems like a much wider scope.


Once in a lifetime collapse

Baseball Prospectus uses Monte Carlo simulation to predict which teams will make the postseason. According to this page, on Sept 1st, the probability of the Red Sox making the playoffs was 99.5%. They were ahead of the Tampa Bay Rays by 9 games. Before last night’s game, in September, the Red Sox had lost 19 of 26 games and were tied with the Rays for the wild card (the last spot for the playoffs). To make this event even more improbable, The Red Sox were up by one in the ninth with two outs and no one on for the last place Orioles. In this situation the team that’s winning, wins more than 95% of the time. The Rays were in exactly the same situation as the Orioles, losing to the first place Yankees (well, their subs). So guess what happened? The Red Sox lost, the Rays won. But perhaps the most amazing event is that these two games, both lasting much more than usual (one due to rain the other to extra innings) ended within seconds of each other. 

Update: Nate Silver beat me to it. And has much more!


The Open Data Movement

I’m not sure which of the categories this infographic on open data falls into, but I find it pretty exciting anyway. It shows the rise of APIs and how data are increasingly open. It seems like APIs are all over the place in the web development community, but less so in health statistics. Although, from the comments, John M. posts places to find free government data including some health data: 

1) CDC’s National Center for Health Statistics,
2) NHANES (National and Health and Nutrition Examination Survey)
3) National Health Interview Survey:
4) World Health Organization:
5) US Census Bureau:
6) Emory maintains a repository of links related to stats/biostat including online databases


The future of graduate education

Stanford is offering a free online course and more than 100,000 students have registered. This got the blogosphere talking about the future of universities. Matt Yglesias thinks that “colleges are the next newspaper and are destined for some very uncomfortable adjustments”. Tyler Cowen reminded us that since 2003 he has been saying that professors are becoming obsolete. His main point is that thanks to the internet, the need for lecturers will greatly diminish. He goes on to predict that

the market was moving towards superstar teachers, who teach hundreds at a time or even thousands online. Today, we have the Khan Academy, a huge increase in online education, electronic textbooks and peer grading systems and highly successful superstar teachers with Michael Sandel and his popular course Justice, serving as example number one.

I think this is particularly true for stat and biostat graduate programs, especially in hard money environments.

A typical Statistics department will admit five to ten PhD students. In most departments we teach probability theory, statistical theory, and applied statistics. Highly paid professors teach these three courses for these five to ten students, which means that the university ends up spending hundreds of thousands of dollars on them.  Where does this money come from? From those that teach hundreds at a time. The stat 101 courses are full of tuition paying students. These students are subsidizing the teaching of our graduate courses. In hard money institutions, they are also subsidizing some of the research conducted by the professors that teach the small graduate courses. Note that 75% of their salaries are covered by the University, yet they are expected to spend much less than 75% of their time preparing and teaching these relatively tiny classes. The leftover time they spend on research for which they have no external funding. This isn’t a bad thing as a lot of good theoretical and basic knowledge has been created this way. However, outside pressure to lower tuition costs has University administrators looking for ways to save and graduate education might be a target. “If you want to teach a class, fill it up with 50 students. If you want to do research, get a grant. ” the administrator might say.

Note that, for example, the stat theory class is pretty much the same every year and across universities. So we can pick a couple of superstar stat theory teachers and have them lead an online course for all the stat and biostat graduate students in the world. Then each department hires an energetic instructor, paying him/her 1/4 what they pay a tenured professor, to sit in a room discussing the online lectures with the five to ten PhD students in the program. Currently there are no incentives for the tenured professor to teach well, but the instructor would be rewarded solely by their teaching performance.  Not only does this scheme cut costs, but it can also increase revenue as faculty will have more time to write grant proposals, etc..

So, with teaching out of the equation, why even have departments? Well, for now the internet can’t substitute the one-on-one interactions needed during PhD thesis supervision. As long as NIH and NSF are around, research faculty will be around. The apprenticeship system that has worked for centuries will survive the uncomfortable adjustments that are coming. Special topic seminars will also survive as faculty will use them as part of their research agenda. Rotations, similar to those implemented in Biology programs, can serve as match makers between professors and students. But classroom teaching is due for some “uncomfortable adjustments”.

I agree with Tyler Cowen and Matt Yglesias: the number of cushy professors jobs per department will drop dramatically in the future, especially in hard money institutions. So let’s get ready. Maybe Biostat departments should start planning for the future now. Harvard, Seattle, Michigan, Emory, etc.. want to teach stat theory with us?

PS -  I suspect this all applies to liberal arts and hard science graduate programs.


The p>0.05 journal

I want to start a journal called “P>0.05”. This journal will publish all the negative results in science. These would also be stored in a database. Think of all the great things we could do with this. We could, for example, plot p-value histograms for different disciplines. I bet most would have a flat distribution. We could also do it by specific association. A paper comes out saying chocolate is linked to weaker bones? Check the histogram and keep eating chocolate. Any publishers interested? 


Some cool papers

  1. A cool article on the regulator’s dilemma. It turns out what is the best risk profile to prevent one bank from failing is not the best risk profile to prevent all banks from failing. 
  2. Persistence of web resources for computational biology. I think this one is particularly relevant for academic statisticians since a lot of academic software/packages are developed by graduate students. Once they move on, a large chunk of “institutional knowledge” is lost. 
  3. Are private schools better than public schools? A quote from the paper: “Indeed when comparing the average score in the two types of schools after adjusting for the enrollment effects, we find quite surprisingly that public schools perform better on average.

"Unoriginal genius"

“The world is full of texts, more or less interesting; I do not wish to add any more”

This quote is from an article in the Chronicle Review. I highly recommend reading the article, particularly check out the section on the author’s “Uncreative writing” class at UPenn. The article is about how there is a trend in literature toward combining/using other people’s words to create new content. 

The prominent literary critic Marjorie Perloff has recently begun using the term “unoriginal genius” to describe this tendency emerging in literature. Her idea is that, because of changes brought on by technology and the Internet, our notion of the genius—a romantic, isolated figure—is outdated. An updated notion of genius would have to center around one’s mastery of information and its dissemination. Perloff has coined another term, “moving information,” to signify both the act of pushing language around as well as the act of being emotionally moved by that process. She posits that today’s writer resembles more a programmer than a tortured genius, brilliantly conceptualizing, constructing, executing, and maintaining a writing machine.

It is fascinating to see this happening in the world of literature; a similar trend seems to be happening in statistics. A ton of exciting and interesting work is done by people combining known ideas and tools and applying them to new problems. I wonder if we need a new definition of “creative”?