Category: Uncategorized

20
Jan

Gorging ourselves on "free" health care: Harvard's dilemma

Editor's note: This is a guest post by Laura Hatfield. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets @bioannie

I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be writing about my benefits package. Then a vocal and aggrieved group of faculty rebelled against health benefits changes for 2015, and commentators responded by gleefully skewering entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.

Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.

spending_distribution

Source: Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.

If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).

At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.

Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?

 

buffet

Source: https://www.flickr.com/photos/jimmonk/5687939526/in/photostream/

One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.

The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.

What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the Cadillac tax provision of the ACA. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably expect to make up that difference in higher wages. The empirical evidence is complicated and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.

As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel argues that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.

16
Jan

If you were going to write a paper about the false discovery rate you should have done it in 2002

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people's genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

 

citations-boxplot

You can see from the plot that the median publication year of the top 30 hits for "empirical processes" was 1990 and for "RNA-seq statistics" was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn't perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the "success" of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited "false discovery rate" paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.

15
Jan

How to find the science paper behind a headline when the link is missing

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

 

Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn't linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

 

Screen Shot 2015-01-15 at 1.11.22 PM

 

 

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don't get to a scientific journal you aren't finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

 

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
Untitled presentation (2)

 

And some key words:

 

Untitled presentation (3)

 

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

 

 

 

Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

 

Screen Shot 2015-01-15 at 1.31.38 PM

 

Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM

 

Unfortunately, the article may be paywalled, so if you don't work at a university or institute with a subscription, you can always tweet the article name with the hashtag #icanhazpdf and your contact info. Then you just have to hope that someone will send it to you (they often do).

 

 

12
Jan

Statistics and R for the Life Sciences: New HarvardX course starts January 19

The first course of our Biomedical Data Science online curriculum
starts next week. You can sign up here. Instead of relying on
mathematical formulas to teach statistical concepts, students can
program along as we show computer code for simulations that illustrate
the main ideas of exploratory data analysis and statistical inference
(p-values, confidence intervals and power calculations for example).
By doing this, students will learn Statistics and R simultaneously and
will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each
video students will have the opportunity to assess their understanding
through homeworks involving coding in R. A big improvement over the
first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.

 

edx_screenshot_v2

07
Jan

Beast mode parenting as shown by my Fitbit data

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:

saturday

 

 

There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:

 

sunday

Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.

 

 

04
Jan

Sunday data/statistics link roundup (1/4/15)

  1. I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
  2. I'm very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
  3. I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
  4. If you haven't read it yet, this piece in the Economist on statisticians during the war is really good.
  5. The arXiv celebrated it's 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.
31
Dec

Ugh ... so close to one million page views for 2014

In my last Sunday Links roundup I mentioned we were going to be really close to 1 million page views this year. Chris V. tried to rally the troops:

 

 

but alas we are probably not going to make it (unless by some miracle one of our posts goes viral in the next 12 hours):

soclose

 

Stay tuned for a bunch of cool new stuff from Simply Stats in 2015, including a new podcasting idea, more interviews, another unconference, and a new plotting theme!

22
Dec

On how meetings and conference calls are disruptive to a data scientist

Editor's note: The week of Xmas eve is usually my most productive of the year. This is because there is reduced emails and 0 meetings (I do take a break, but after this great week for work). Here is a repost of one of our first entries explaining how meetings and conference calls are particularly disruptive in data science. 

In this TED talk Jason Fried explains why work doesn't happen at work. He describes the evils of meetings. Meetings are particularly disruptive for applied statisticians, especially for those of us that hack data files, explore data for systematic errors, get inspiration from visual inspection, and thoroughly test our code. Why? Before I become productive I go through a ramp-up/boot-up stage. Scripts need to be found, data loaded into memory, and most importantly, my brains needs to re-familiarize itself with the data and the essence of the problem at hand. I need a similar ramp up for writing as well. It usually takes me between 15 to 60 minutes before I am in full-productivity mode. But once I am in “the zone”, I become very focused and I can stay in this mode for hours. There is nothing worse than interrupting this state of mind to go to a meeting. I lose much more than the hour I spend at the meeting. A short way to explain this is that having 10 separate hours to work is basically nothing, while having 10 hours in the zone is when I get stuff done.

Of course not all meetings are a waste of time. Academic leaders and administrators need to consult and get advice before making important decisions. I find lab meetings very stimulating and, generally, productive: we unstick the stuck and realign the derailed. But before you go and set up a standing meeting consider this calculation: a weekly one hour meeting with 20 people translates into 1 hour x 20 people x 52 weeks/year = 1040 person hours of potentially lost production per year. Assuming 40 hour weeks, that translates into six months. How many grants, papers, and lectures can we produce in six months? And this does not take into account the non-linear effect described above. Jason Fried suggest you cancel your next meeting, notice that nothing bad happens and enjoy the extra hour of work.

I know many others that are like me in this regard and for you I have these recommendations: 1- avoid unnecessary meetings, especially if you are already in full-productivity mode. Don’t be afraid to use this as an excuse to cancel.  If you are in a soft $ institution, remember who pays your salary.  2- Try to bunch all the necessary meetings all together into one day. 3- Separate at least one day a week to stay home and work for 10 hours straight. Jason Fried also recommends that every work place declare a day in which no one talks. No meetings, no chit-chat, no friendly banter, etc… No talk Thursdays anyone?

21
Dec

Sunday data/statistics link roundup (12/21/14)

James Stewart, author of the most popular Calculus textbook in the world, passed away. In case you wonder if there is any money in textbooks, he had a $32 million house in Toronto. Maybe I should get out of MOOCs and into textbooks.

  1. This post on medium about a new test for causality is making the rounds.  The authors of the original paper make clear their assumptions make the results basically unrealistic for any real analysis for example:"We simplify the causal discovery problem by assuming no confounding, selection bias and feedback." The medium article is too bold and as I replied to an economist who tweeted there was a new test that could distinguish causality: "Nope".
  2. I'm excited that the Rafa + the ASA have started a section on Genomics and Genetics. It is nice to have a place to belong within our community. I hope it can be a place where folks who aren't into the hype (a lot of those in genomics), but really care about applications, can meet each other and work together.
  3. Great essay by Hanna W. about data, machine learning and fairness. I love this quote: "in order to responsibly articulate and address issues relating to bias, fairness, and inclusion, we need to stop thinking of big data sets as being homogeneous, and instead shift our focus to the many diverse data sets nested within these larger collections." (via Hilary M.)
  4. Over at Flowing Data they ran down the best data visualizations of the year.
  5. This rant from Dirk E. perfectly encapsulates every annoying thing about the Julia versus R comparisons I see regularly.
  6. We are tantalizingly close to 1 million page views for the year for Simply Stats. Help get us over the edge, share your favorite simply stats article with all your friends using the hashtag #simplystats1e6
19
Dec

Interview with Emily Oster

Emily Oster
Emily Oster is an Associate Professor of Economics at Brown University. She is a frequent and highly respected contributor to 538 where she brings clarity to areas of interest to parents, pregnant woman, and the general public where empirical research is conflicting or difficult to interpret. She is also the author of the popular new book about pregnancy: Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to KnowWe interviewed Emily as part of our ongoing interview series with exciting empirical data scientists. 
 
SS: Do you consider yourself an economist, econometrician, statistician, data scientist or something else?
EO: I consider myself an empirical economist. I think my econometrics colleagues would have a hearty laugh at the idea that I'm an econometrician! The questions I'm most interested in tend to have a very heavy empirical component - I really want to understand what we can learn from data. In this sense, there is a lot of overlap with statistics. But at the end of the day, the motivating questions and the theories of behavior I want to test come straight out of economics.
SS: You are a frequent contributor to 538. Many of your pieces are attempts to demystify often conflicting sets of empirical research (about concussions and suicide, or the dangers of water flouridation). What would you say are the issues that make empirical research about these topics most difficult?
 
EO: In nearly all the cases, I'd summarize the problem as : "The data isn't good enough." Sometimes this is because we only see observational data, not anything randomized. A large share of studies using observational data that I discuss have serious problems with either omitted variables or reverse causality (or both).  This means that the results are suggestive, but really not conclusive.  A second issue is even when we do have some randomized data, it's usually on a particular population, or a small group, or in the wrong time period. In the flouride case, the studies which come closest to being "randomized" are from 50 years ago. How do we know they still apply now?  This makes even these studies challenging to interpret.
SS: Your recent book "Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know" takes a similar approach to pregnancy. Why do you think there are so many conflicting studies about pregnancy? Is it because it is so hard to perform randomized studies?
 
EO: I think the inability to run randomized studies is a big part of this, yes. One area of pregnancy where the data is actually quite good is labor and delivery. If you want to know the benefits and consequences of pain medication in labor, for example, it is possible to point you to some reasonably sized randomized trials. For various reasons, there has been more willingness to run randomized studies in this area. When pregnant women want answers to less medical questions (like, "Can I have a cup of coffee?") there is typically no randomized data to rely on. Because the possible benefits of drinking coffee while pregnant are pretty much nil, it is difficult to conceptualize a randomized study of this type of thing.
Another big issue I found in writing the book was that even in cases where the data was quite good, data often diverges from practice. This was eye-opening for me and convinced me that in pregnancy (and probably in other areas of health) people really do need to be their own advocates and know the data for themselves.
SS: Have you been surprised about the backlash to your book for your discussion of the zero-alcohol policy during pregnancy? 
 
EO: A little bit, yes. This backlash has died down a lot as pregnant women actually read the book and use it. As it turns out, the discussion of alcohol makes up a tiny fraction of the book and most pregnant women are more interested in the rest of it!  But certainly when the book came out this got a lot of focus. I suspected it would be somewhat controversial, although the truth is that every OB I actually talked to told me they thought it was fine. So I was surprised that the reaction was as sharp as it was.  I think in the end a number of people felt that even if the data were supportive of this view, it was important not to say it because of the concern that some women would over-react. I am not convinced by this argument.
SS: What are the three most important statistical concepts for new mothers to know? 
 
EO: I really only have two!
I think the biggest thing is to understand the difference between randomized and non-randomized data and to have some sense of the pittfalls of non-randomized data. I reviewed studies of alcohol where the drinkers were twice as likely as non-drinkers to use cocaine. I think people (pregnant or not) should be able to understand why one is going to struggle to draw conclusions about alcohol from these data.
A second issue is the concept of probability. It is easy to say, "There is a 10% chance of the following" but do we really understand that? If someone quotes you a 1 in 100 risk from a procedure, it is important to understand the difference between 1 in 100 and 1 in 400.  For most of us, those seem basically the same - they are both small. But they are not, and people need to think of ways to structure decision-making that acknowledge these differences.
SS: What computer programming language is most commonly taught for data analysis in economics? 
 
EO: So, I think the majority of empirical economists use Stata. I have been seeing more R, as well as a variety of other things, but more commonly among people who do heavier computational fields.
SS: Do you have any advice for young economists/statisticians who are interested in empirical research? 
EO:
1. Work on topics that interest you. As an academic you will ultimately have to motivate yourself to work. If you aren't interested in your topic (at least initially!), you'll never succeed.
2. One project which is 100% done is way better than five projects at 80%. You need to actually finish things, something which many of us struggle with.
3. Presentation matters. Yes, the substance is the most important thing, but don't discount the importance of conveying your ideas well.