Simply Statistics


Obamacare is not going to solve the health care crisis, but a new initiative, led by a statistician, may help

Obamacare may help protect a vulnerable section of our population, but it does nothing to solve the real problem with health care in the US: it is unsustainably expensive and getting worst worse. In the graph below (left) per capita medical expenditures for several countries are plotted against time. The US is the black curve, other countries are in grey. On the right we see life expectancy plotted against per capita medical expenditure. Note that the US spends $8,000 per person on healthcare, more than any other country and about 40% more than Norway, the runner up. If the US spent the same as Norway per person, as a country we would save ~ 1 trillion $ per year. Despite the massive investment, life expectancy in the US is comparable to Chile’s, a country that spends about $1,500 per person. To make matters worse, politicians and pundits greatly oversimply this problem by trying to blame their favorite villains while experts agree: no obvious solution exists.

This past Tuesday Johns Hopkins announced the launching of the Individualized Health Initiative. This effort will be led by Scott Zeger, a statistician and former chair of our department. The graphs and analysis shown above are from a presentation Scott has  shared on the web. The initiative’s goal is to “discover, test, and implement health information tools that allow the individual to understand, track, and guide his or her unique health state and its trajectory over time”. In other words, by tailoring treatments and prevention schemes for individuals we can improve their health more effectively.

Johns Hopkins University


So how is this going to help solve the health care crisis? Scott explains that when it comes to health care, Hopkins is a self-contained microcosm: we are the patients (all employees), the providers (hospital and health system), and the insurer (Hopkins is self-insured, we are not insured by for-profit companies). And just like the rest of the country, we spend way too much per person on health care. Now, because we are self-contained, it is much easier for us to try out and evaluate alternative strategies than it is for, say, a state or the federal government. Because we are large, we can gather enough data to learn about relatively small strata. And with a statistician in charge, we will evaluate strategies empirically as opposed to ideologically.  

Furthermore, because we are a University, we also employ Economists, Public Health Specialists, Ethicists, Basic Biologists, Engineers, Biomedical Researchers, and other scientists with expertise that seem indispensable to solve this problem. Under Scott’s leadership, I expect Hopkins to collect data more systematically, run well thought-out experiments to test novel ideas, leverage technology to improve diagnostics, and use existing data to create knowledge. Successful strategies may then be exported to the rest of the country. Part of the new institute’s mission is to incentivize our very creative community of academics to participate in this endeavor. 


Motivating statistical projects

It seems like half of the battle in statistics is identifying an important/unsolved problem. In math, this is easy, they have a list. So why is it harder for statistics? Since I have to think up projects to work on for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise.

I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram). The diagram shows the rough relationship of science, data, applied statistics, and theoretical statistics. Science produces data (although there are other sources), the data are analyzed using applied statistical methods, and theoretical statistics concerns the math behind statistical methods. The dotted line indicates that theoretical statistics ostensibly generalizes applied statistical methods so they can be applied in other disciplines. I do think that this type of generalization is becoming harder and harder as theoretical statistics becomes farther and farther removed from the underlying science.

Based on this diagram I see three major sources for statistical problems: 

  1. Theoretical statistical problems One component of statistics is developing the mathematical and foundational theory that proves we are doing sensible things. This type of problem often seems to be inspired by popular methods that exists/are developed but lack mathematical detail. Not surprisingly, much of the work in this area is motivated by what is mathematically possible or convenient, rather than by concrete questions that are of concern to the scientific community. This work is important, but the current distance between theoretical statistics and science suggests that the impact will be limited primarily to the theoretical statistics community. 
  2. Applied statistics motivated by convenient sources of data. The best example of this type of problem are the analyses in Freakonomics.  Since both big data and small big data are now abundant, anyone with a laptop and an internet connection can download the Google n-gram data, a microarray from GEO data about your city, or really data about anything and perform an applied analysis. These analyses may not be straightforward for computational/statistical reasons and may even require the development of new methods. These problems are often very interesting/clever and so are often the types of analyses you hear about in newspaper articles about “Big Data”. But they may often be misleading or incorrect, since the underlying questions are not necessarily well founded in scientific questions. 
  3. Applied statistics problems motivated by scientific problems. The final category of statistics problems are those that are motivated by concrete scientific questions. The new sources of big data don’t necessarily make these problems any easier. They still start with a specific question for which the data may not be convenient and the math is often intractable. But the potential impact of solving a concrete scientific problem is huge, especially if many people who are generating data have a similar problem. Some examples of problems like this are: can we tell if one batch of beer is better than another, how are quantitative characteristics inherited from parent to child, which treatment is better when some people are censored, how do we estimate variance when we don’t know the distribution of the data, or how do we know which variable is important when we have millions

So this leads back to the question, what are the biggest open problems in statistics? I would define these problems as the “high potential impact” problems from category 3. To answer this question, I think we need to ask ourselves, what are the most common problems people are trying to solve with data but can’t with what is available right now? Roger nailed this when he talked about the role of statisticians in the science club

Here are a few ideas that could potentially turn into high-impact statistical problems, maybe our readers can think of better ones?

  1. How do we credential students taking online courses at a huge scale?
  2. How do we communicate risk about personalized medicine (or anything else) to a general population without statistical training? 
  3. Can you use social media as a preventative health tool?
  4. Can we perform randomized trials to improve public policy?
Image Credits: The Science Logo is the old logo for the USU College of Science, the R is the logo for the R statistical programming language, the data image is a screenshot of Gapminder, and the theoretical statistics image comes from the Wikipedia page on the law of large numbers.

Edit: I just noticed this paper, which seems to support some of the discussion above. On the other hand, I think just saying lots of equations = less citations falls into category 2 and doesn’t get at the heart of the problem. 

The price of skepticism

Thanks to John Cook for posting this:

“If you’re only skeptical, then no new ideas make it through to you. You never can learn anything. You become a crotchety misanthrope convinced that nonsense is ruling the world.” – Carl Sagan


Follow up on "Statistics and the Science Club"

I agree with Roger’s latest post: “we need to expand the tent of statistics and include people who are using their statistical training to lead the new science”. I am perhaps a bit more worried than Roger. Specifically, I worry that talented go-getters interested in leading science via data analysis will achieve this without engaging our research community. 

A  quantitatively trained person (engineers , computer scientists, physicists, etc..) with strong computing skills (knows python, C, and shell scripting), that reads, for example, “Elements of Statistical Learning” and learns R, is well on their way. Eventually, many of these users of Statistics will become developers and if we don’t keep up then what do they need from us? Our already-written books may be enough. In fact, in genomics, I know several people like this that are already developing novel statistical methods. I want these researchers to be part of our academic departments. Otherwise, I fear we will not be in touch with the problems and data that lead to, quoting Roger, “the most exciting developments of our lifetime.” 


The problem with small big data

There’s lots of talk about “big data” these days and I think that’s great. I think it’s bringing statistics out into the mainstream (even if they don’t call it statistics) and it creating lots of opportunities for people with statistics training. It’s one of the reasons we created this blog.

One thing that I think gets missed in much of the mainstream reporting is that, in my opinion, the biggest problems aren’t with the truly massive datasets out there that need to be mined for important information. Sure, those types of problems pose interesting challenges with respect to hardware infrastructure and algorithm design.

I think a bigger problem is what I call “small big data”. Small big data is the dataset that is collected by an individual whose data collection skills are far superior to his/her data analysis skills. You can think of the size of the problem as being measured by the ratio of the dataset size to the investigator’s statistical skill level. For someone with no statistical skills, any dataset represents “big data”.

These days, any individual can create a massive dataset with relatively few resources. In some of the work I do, we send people out with portable air pollution monitors that record pollution levels every 5 minutes over a 1-week period. People with fitbits can get highly time-resolved data about their daily movements. A single MRI can produce millions of voxels of data.

One challenge here is that these examples all represent datasets that are large “on paper”. That is, there are a lot of bits to store, but that doesn’t mean there’s a lot of useful information there. For example, I find people are often impressed by data that are collected with very high temporal or spatial resolution. But often, you don’t need that level of detail and can get away with coarser resolution over a wider range of scenarios. For example, if you’re interested in changes in air pollution exposure across seasons but you only measure people in the summer, then it doesn’t matter if you measure levels down to the microsecond and produce terabytes of data. Another example might be the idea the sequencing technology doesn’t in fact remove biological variability, no matter how large a dataset it produces.

Another challenge is that the person who collected the data is often not qualified/prepared to analyze it. If the data collector didn’t arrange beforehand to have someone analyze the data, then they’re often stuck. Furthermore, usually the grant that paid for the data collection didn’t budget (enough) for the analysis of the data. The result is that there’s a lot of “small big data” that just sits around unanalyzed. This is an unfortunate circumstance, but in my experience quite common.

One conclusion we can draw is that we need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data). But the sad truth is that there aren’t enough of us on the planet to fill the demand. So we need to come up with more creative ways to get the skills out there without requiring our physical presence.


A specific suggestion to help recruit/retain women faculty at Hopkins

A recent article by a former Obama administration official has stirred up debate over the obstacles women face in balancing work/life. This reminded me of this report written by  a committee here at Hopkins to help resolve the current gender-based career obstacles for women faculty. The report is great, but in practice we have a long way to go. For example, my department has not hired a woman at the tenure track level in 15 years. This drought has not been for lack of trying as we have made several offers, but none have been accepted. One issue that has come up multiple times is “spousal hires”. Anecdotal evidence strongly suggests that in academia the “two body” problem is more common with women than men. As hard as my department has tried to find jobs for spouses, efforts are ad-hoc and we get close to no institutional support. As far as I know, as an institution, Hopkins allocates no resources to spousal hires. So, a tangible improvement we could make is changing this. Another specific improvement that many agree will help women is subsidized day care. The waiting list here is very long (as a result few of my colleagues use it) and one still has to pay more than $1,600 a month for infants.

These two suggestions are of course easier said than done as they both require $. Quite of bit actually, and Hopkins is not rich compared to other well-known universities. My suggestion is to get rid of the college tuition remission benefit for faculty. Hopkins covers half the college tuition for the children of all their employees. This perk helps male faculty in their 50s much more than it helps potential female recruits. So I say get rid of this benefit and use the $ for spousal hires and to further subsidize childcare.

It might be argued the tuition remission perk helps retain faculty, but the institution can invest in that retention on a case-by-case basis as opposed to giving the subsidy to everybody independent of merit. I suspect spousal hires and subsidized day care will be more attractive at the time of recruitment. 

Although this post is Hopkins-specific I am sure similar reallocation of funds is possible in other universities.


Sunday data/statistics link roundup (6/24)

  1. We’ve got a new domain! You can still follow us on tumblr or here:
  2. A cool article on MIT’s annual sports statistics conference (via @storeylab). I love how the guy they chose to highlight created what I would consider a pretty simple visualization with known tools - but it turns out it is potentially a really new way of evaluating the shooting range of basketball players. This is my favorite kind of creativity in statistics.
  3. This is an interesting article calling higher education a “credentials cartel”. I don’t know if I’d go quite that far; there are a lot of really good reasons for higher education institutions beyond credentialing like research, putting smart students together in classes and dorms, broadening experiences etc. But I still think there is room for a smart group of statisticians/computer scientists to solve the credentialing problem on a big scale and have a huge impact on the education industry. 
  4. Check out John Cook’s conjecture on statistical methods that get used: “The probability of a method being used drops by at least a factor of 2 for every parameter that has to be determined by trial-and-error.” I’m with you. I wonder if there is a corollary related to how easy the documentation is to read? 
  5. If you haven’t read Roger’s post on Statistics and the Science Club, I consider it a must-read for anyone who is affiliated with a statistics/biostatistics department. We’ve had feedback by email/on twitter from other folks who are moving toward a more science oriented statistical culture. We’d love to hear from more folks with this same attitude/inclination/approach. 

Statistics and the Science Club

One of my favorite movies is Woody Allen’s Annie Hall. If you’re my age and you haven’t seen it, I usually tell people it’s like When Harry Met Sally, except really good. The movie opens with Woody Allen’s character Alvy Singer explaining that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud).

Last week I posted a link to ASA President Robert Rodriguez’s column in Amstat News about big data. In the post I asked what was wrong with the column and there were a few good comments from readers. In particular, Alex wrote:

When discussing what statisticians need to learn, he focuses on technological changes (distributed computing, Hadoop, etc.) and the use of unstructured text data. However, Big Data requires a change in perspective for many statisticians. Models must expand to address the levels of complexity that massive datasets can reveal, and many standard techniques are limited in utility.

I agree with this, but I don’t think it goes nearly far enough. 

The key element missing from the column was the notion that statistics should take a  leadership role in this area. I was disappointed by the lack of a more expansive vision displayed by the ASA President and the ASA’s unwillingness to claim a leadership position for the field. Despite the name “big data”, big data is really about statistics and statisticians should really be out in front of the field. We should not be observing what is going on and adapting to it by learning some new technologies or computing techniques. If we do that, then as a field we are just leading from behind. Rather, we should be defining what is important and should be driving the field from both an educational and research standpoint.

However, the new era of big data poses a serious dilemma for the statistics community that needs to be addressed before real progress can be made, and that’s what brings me to Alvy Singer’s conundrum.

There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many cases, we are the outsiders to scientific investigation. Even if we are neck deep in collaborating with scientists and being involved in scientific work, we still maintain our ability to criticize and judge scientists because we are “outsiders” trained in a different set of (important) skills. In many ways, this is a Good Thing. The outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things. It’s our job to keep people honest. However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.

Big data poses a challenge to this long-standing tradition because all of the sudden statistics and science are more intertwined then ever before and statistical methodology is absolutely critical to making inferences or gaining insight from data. Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves!

This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least for the moment). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters. I think as an individual it’s very difficult to be both simply because there are only 24 hours in the day. It takes an enormous amount of time to learn the scientific background required to lead scientific investigations and this is piled on top of whatever statistical training you receive.

However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science. They may not be publishing papers in the Annals of Statistics or in JASA, but they are statisticians. If we do not move more in this direction, we risk missing out on one of the most exciting developments of our lifetime.