Simply Statistics


Introduction to Linear Models and Matrix Algebra MOOC starts this Monday Feb 16

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Matrix algebra is the language of modern data analysis. We use it to develop and describe statistical and machine learning methods, and to code efficiently in languages such as R, matlab and python. Concepts such as principal component analysis (PCA) are best described with matrix algebra. It is particularly useful to describe linear models.

Linear models are everywhere in data analysis. ANOVA, linear regression, limma, edgeR, DEseq, most smoothing techniques, and batch correction methods such as SVA and Combat are based on linear models. In this two week MOOC we well describe the basics of matrix algebra, demonstrate how linear models are used in the life sciences and show how to implement these efficiently in R.

Update: Here is the link to the class


Is Reproducibility as Effective as Disclosure? Let's Hope Not.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Jeff and I just this week published a commentary in the Proceedings of the National Academy of Sciences on our latest thinking on reproducible research and its ability to solve the reproducibility/replication "crisis" in science (there's a version on arXiv too). In a nutshell, we believe reproducibility (making data and code available so that others can recompute your results) is an essential part of science, but it is not going to end the crisis of confidence in science. In fact, I don't think it'll even make a dent. The problem is that reproducibility, as a tool for preventing poor research, comes in at the wrong stage of the research process (the end). While requiring reproducibility may deter people from committing outright fraud (a small group), it won't stop people who just don't know what they're doing with respect to data analysis (a much larger group).

In an eerie coincidence, Jesse Eisinger of the investigative journalism non-profit ProPublica, has just published a piece on the New York Times Dealbook site discussing how requiring disclosure rules in the financial industry has produced meager results. He writes

Over the last century, disclosure and transparency have become our regulatory crutch, the answer to every vexing problem. We require corporations and government to release reams of information on food, medicine, household products, consumer financial tools, campaign finance and crime statistics. We have a booming “report card” industry for a range of services, including hospitals, public schools and restaurants.

The rationale for all this disclosure is that

someone, somewhere reads the fine print in these contracts and keeps corporations honest. It turns out what we laymen intuit is true: No one reads them, according to research by a New York University law professor, Florencia Marotta-Wurgler.

But disclosure is nevertheless popular because how could you be against it?

The disclosure bonanza is easy to explain. Nobody is against it. It’s politically expedient. Companies prefer such rules, especially in lieu of actual regulations that would curtail bad products or behavior. The opacity lobby — the remora fish class of lawyers, lobbyists and consultants in New York and Washington — knows that disclosure requirements are no bar to dodgy practices. You just have to explain what you’re doing in sufficiently incomprehensible language, a task that earns those lawyers a hefty fee.

In the now infamous Duke Saga, Keith Baggerly was able to reproduce the work of Potti et al. after roughly 2,000 hours of work because the data were publicly available (although the code was not). It's not clear how much time would have been saved if the code had been available, but it seems reasonable to assume that it would have taken some amount of time to understand the analysis, if not reproduce it. Once the errors in Potti's work were discovered, it took 5 years for the original Nature Medicine paper to be retracted.

Although you could argue that the process worked in some sense, it came at tremendous cost of time and money. Wouldn't it have been better if the analysis had been done right in the first place?


The trouble with evaluating anything

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

It is very hard to evaluate people's productivity or work in any meaningful way. This problem is the source of:

  1. Consternation about peer review
  2. The reason why post publication peer review doesn't work
  3. Consternation about faculty evaluation
  4. Major problems at companies like Yahoo and Microsoft.

Roger and I were just talking about this problem in the context of evaluating the impact of software as a faculty member and Roger suggested the problem is that:

Evaluating people requires real work and so people are always looking for shortcuts

To evaluate a person's work or their productivity requires three things:

  1. To be an expert in what they do
  2. To have absolutely no reason to care whether they succeed or not
  3. To have time available to evaluate them

These three fundamental things are at the heart of why it is so hard to get good evaluations of people and why peer review and other systems are under such fire. The main source of the problem is the conflict between 1 and 2. The group of people in any organization or on any scale that is truly world class at any given topic from software engineering to history is small. It has to be by definition. This group of people inevitably has some reason to care about the success of the other people in that same group. Either they work with the other world class people and want them to succeed or they  either intentionally or unintentionally are competing with them.

The conflict between being and expert and having no say wouldn't be such a problem if it wasn't for issue number 3: the time to evaluate people. To truly get good evaluations what you need is for someone who isn't an expert in a field and so has no stake to take the time to become an expert and then evaluate the person/software. But this requires a huge amount of effort on the part of a reviewer who has to become expert in a new field. Given that reviewing is often considered the least important task in people's workflow, evidenced by the value we put on people acting as peer reviewers for journals, or the value people get for doing a good job in people's evaluation for promotion in companies, it is no wonder people don't take the time to become experts.

I actually think that tenure review committees at forward thinking places may be the best at this (Rafa said the same thing about NIH study section). They at least attempt to get outside reviews from people who are unbiased about the work that a faculty member is doing before they are promoted. This system, of course, has large and well-document problems, but I think it is better than having a person's direct supervisor - who clearly has a stake - being the only person evaluating them.It is also better than only using the quantifiable metrics like number of papers and impact factor of the corresponding journals. I also think that most senior faculty who evaluate people take the job very seriously despite the only incentive being good citizenship.

Since real evaluation requires hard work and expertise, most of the time people are looking for a short cut. These short cuts typically take the form of quantifiable metrics. In the academic world these shortcuts are things like:

  1. Number of papers
  2. Citations to academic papers
  3. The impact factor of a journal
  4. Downloads to a person's software

I think all of these things are associated with quality but none define quality. You could try to model the relationship, but it is very hard to come up with a universal definition for the outcome you are trying to model. In academics, some people have suggested that open review or post-publication review solves the problem. But this is only true for a very small subset of cases that violate rule number 2. The only papers that get serious post-publication review are where people have an incentive for the paper to go one way or the other. This means that papers in Science will be post-pub reviewed much much more often than equally important papers in discipline specific journals - just because people care more about Science. This will leave the vast majority of papers unreviewed - as evidenced by the relatively modest number of papers reviewed by PubPeer or Pubmed Commons.

I'm beginning to think that the only way to do evaluation well is to hire people whose only job is to evaluate something well. In other words, peer reviewers who are paid to review papers full time and are only measured by how often those papers are retracted or proved false. Or tenure reviewers who are paid exclusively to evaluate tenure cases and are measured by how well the post-tenure process goes for the people they evaluate and whether there is any measurable bias in their reviews.

The trouble with evaluating anything is that it is hard work and right now we aren't paying anyone to do it.



Johns Hopkins Data Science Specialization Top Performers

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: The Johns Hopkins Data Science Specialization is the largest data science program in the world.  Brian, Roger, and myself  conceived the program at the beginning of January 2014 , then built, recorded, and launched the classes starting in April 2014 with the help of Ira.  Since April 2014 we have enrolled 1.76 million student and awarded 71,589 Signature Track verified certificates. The first capstone class ran in October - just 7 months after the first classes launched and 4 months after all classes were running. Despite this incredibly short time frame 917 students finished all 9 classes and enrolled in the Capstone Course. 478 successfully completed the course.

When we first announced the the Data Science Specialization, we said that the top performers would be profiled here on Simply Statistics. Well, that time has come, and we've got a very impressive group of participants that we want to highlight. These folks have successfully completed all nine MOOCs in the specialization and earned top marks in our first capstone session with SwiftKey. We had the pleasure of meeting some of them last week in a video conference, and we were struck by their insights and expertise. Check them out below.

Sasa Bogdanovic






Sasa Bogdanovic is passionate about everything data. For the last 6 years, he's been working in the iGaming industry, providing data products (integrations, data warehouse architectures and models, business intelligence tools, analyst reports and visualizations) for clients, helping them make better, data-driven, business decisions.

Why did you take the JHU Data Science Specialization?

Although I've been working with data for many years, I wanted to take a different perspective and learn more about data science concepts and get insights into the whole pipeline from acquiring data to developing final data products. I also wanted to learn more about statistical models and machine learning.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am very happy to have discovered the data science field. It is a whole new world that I find fascinating and inspiring to explore. I am looking forward to my new career in data science. This will allow me to combine all my previous knowledge and experience with my new insights and methods. I am very proud of every single quiz, assignment and project. For sure, the capstone project was a culmination, and I am very proud and happy to have succeeded to make a solid data product and to be a one of the top performers in the group. For this I am very grateful to the instructors, community TAs, all other peers for their contributions in the forums, and Coursera for putting it all together and making it possible.

How are you planning on using your Data Science Specialization Certificate?

I have already put the certificate in motion. My company is preparing new projects, and I expect the certificate to add weight to our proposals.

Alejandro Morales Gallardo







I’m a trained physicist with strong coding skills. I have a passion for dissecting datasets to find the hidden stories in data and produce insights through creative visualizations. A hackathon and open-data aficionado, I have an interest in using data (and science) to improve our lives.

Why did you take the JHU Data Science Specialization?

I wanted to close a gap in my skills and transition into to becoming a full blown Data Scientist by learning key concepts and practices in the field. Learning R, an industry relevant language, while creating a portfolio to showcase my abilities in the entire data science pipeline seemed very attractive.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm most proud of the Predictive Text App I developed. With the Capstone Project, it was extremely rewarding to be able to tackle a brand new data type and learn about text mining and natural language processing while building a fun and attractive data product. I was particularly proud that the accuracy of my app was not that far off from SwiftKey smartphone app. I'm also proud of being a top performer!

How are you planning on using your Data Science Specialization Certificate?

I want to apply my new set of skills to develop other products, analyze new datasets and keep growing my portfolio. It is also helpful to have Verified Certificates to show prospective employers.

Nitin Gupta







Nitin is an independent trader and quant strategist with over 13 years of multi-faceted experience in the investment management industry. In the past he worked for a leading investment management firm where he built automated trading and risk management systems and gained complete life-cycle expertise in creating systematic investment products. He has a background in computer science with a strong interest in machine learning and its applications in quantitative modeling.

Why did you take the JHU Data Science Specialization?

I was fortunate to have done the first Machine Learning course taught by Prof. Andrew Ng at the launch of Coursera in 2012, which really piqued my interest in the topic. The next course I did on Coursera was Prof. Roger Peng's Computing For Data Analysis which introduced me to R. I realized that R was ideally suited for the quantitative modeling work I was doing. When I learned about the range of topics that the JHU DSS would cover - from the best practices in tidying and transforming data to modeling, analysis and visualization - I did not hesitate to sign up. Learning how to do all of this in an ecosystem built around R has been a huge plus.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am quite pleased with the web apps I built which utilize the concepts learned during the track. One of my apps visualizes and compares historical stock performance with other stocks and market benchmarks after querying the data directly from web resources. Another one showcases a predictive typing engine that dynamically predicts the next few words to use and append, as the user types a sentence. The process of building these apps provided a fantastic learning experience. Also, for the first time I built something that even my near and dear ones could use and appreciate, which is terrific.

How are you planning on using your Data Science Specialization Certificate?

The broad skill set developed through this specialization could be applied across multiple domains. My current focus is on building robust quantitative models for systematic trading strategies that could learn and adapt to changing market environments. This would involve the application of machine learning techniques among other skills learned during the specialization. Using R and Shiny to interactively analyze the results would be tremendously useful.

Marc Kreyer


Marc Kreyer





Marc Kreyer is an expert business analyst and software engineer with extensive experience in financial services in Austria and Liechtenstein. He successfully finishes complex projects by not only using broad IT knowledge but also outstanding comprehension of business needs. Marc loves combining his programming and database skills with his affinity for mathematics to transform data into insight.

Why did you take the JHU Data Science Specialization?

There are many data science MOOCs, but usually they are independent 4-6 week courses. The JHU Data Science Specialization was the first offering of a series of courses that build upon each other.

What are you most proud of doing as part of the JHU Data Science Specialization?

Creating a working text prediction app without any prior NLP knowledge and only minimal assistance from instructors.

How are you planning on using your Data Science Specialization Certificate?

Knowledge and experience are the most valuable things gained from the Data Science Specialization. As they can't be easily shown to future employers, the certificate can be a good indicator for them. Unfortunately there is neither an issue data nor a verification link on the certificate, therefore it will be interesting to see how valuable it really will be.

Hsing Liu



I studied in the U.S. for a number of years, and received my M.S. in mathematics from NYU before returning to my home country, Taiwan. I'm most interested in how people think and learn, and education in general. This year I'm starting a new career as an iOS app engineer.

Why did you take the JHU Data Science Specialization?

In my brief past job as an instructional designer, I read a lot about the new wave of online education, and was especially intrigued by how Khan Academy's data science division is using data to help students learn. It occurred to me that to leverage my math background and make a bigger impact in education (or otherwise), data science could be an exciting direction to take.

What are you most proud of doing as part of the JHU Data Science Specialization?

It may sound boring, but I'm proud of having done my best for each course in the track, going beyond the bare requirements when I'm able. The parts of the Specialization fit into a coherent picture of the discipline, and I'm glad to have put in the effort to connect the dots and gained a new perspective.

How are you planning on using your Data Science Specialization Certificate?

I'm listing the certificate on my resume and LinkedIn, and I expect to be applying what I've learned once my company's e-commence app launch.

Yichen Liu


Yichen Liu is a business analyst at Toyota Western Australia where he is responsible for business intelligence development, data analytics and business improvement. His prior experience includes working as a sessional lecturer and tutor at Curtin University in finance and econometrics units.

Why did you take the JHU Data Science Specialization?

Recognising the trend that the world is more data-driven than before, I felt it was necessary to gain further understanding in data analysis to tackle both current and future challenges at work.

What are you most proud of doing as part of the JHU Data Science Specialization?

The most proud thing as part of the program is that I have gained some basic knowledge in a totally new area, natural language processing. Though its connection with my current working area is limited, I see the future of data analysis to be more unstructured-data-drive and am willing to develop more knowledge in this area.

How are you planning on using your Data Science Specialization Certificate?

I see the certificate as a stepping stone into the data science world, and would like to conduct more advanced studies in data science especially for unstructured data analysis.

Johann Posch


After graduating form Vienna University of Technology with a specialization in Artificial Intelligence I joined Microsoft. There I worked as a developer on various products but the majority of the time as a Windows OS developer. After venturing into start-ups for a few years I joined GE Research to work on the Predix Big Data Platform and recently I joined on the Industrial Data Science team.

Why did you take the JHU Data Science Specialization?

Ever since I wrote my masters thesis in Neural Networks I have been intrigued with machine learning. I see data science as a field where great advances will happen over the next decade and as an opportunity to positively impact millions of lives. I like how JHU structured the course series.

What are you most proud of doing as part of the JHU Data Science Specialization?

Being able to complete the JHU Data Science Specialization in 6 months and to get an distinction on every one of the courses was a great success. However, the best moment was probably the way my capstone project (next word prediction) turned out. The model could be trained in incremental steps and how it was able to provide meaningful options in real time.

How are you planning on using your Data Science Specialization Certificate?

The course covered the concepts and tools needed to successfully address data science problems. It gave me the confidence and knowledge to apply for data science position. I am now working in the field at GE Research. I am grateful to all who made this Specialization happen!

Jason Wilkinson







Jason Wilkinson is a trader of commodity futures and other financial securities at a small proprietary trading firm in New York City. He and his wife, Katie, and dog, Charlie, can frequently be seen at the Jersey shore. And no, it's nothing like the tv show, aside from the fist pumping.

Why did you take the JHU Data Science Specialization?

The JHU Data Science Specialization helped me to prepare as I begin working on a Masters of Computer Science specializing in Machine Learning at Georgia Tech and also in researching algorithmic trading ideas. I also hope to find ways of using what I've learned in philanthropic endeavors, applying data science for social good.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm most proud of going from knowing zero R code to being able to apply it in the capstone and other projects in such a short amount of time.

How are you planning on using your Data Science Specialization Certificate?

The knowledge gained in pursuing the specialization certificate alone was worth the time put into it. A certificate is just a piece of paper. It's what you can do with the knowledge gained that counts.

Uli Zellbeck




I studied economics in Berlin with focus on econometrics and business informatics. I am currently working as a Business Intelligence / Data Warehouse Developer in an e-commerce company. I am interested in recommender systems and machine learning.

Why did you take the JHU Data Science Specialization?

I wanted to learn about Data Science because it provides a different approach on solving business problems with data. I chose the JHU Data Science Specialization on Coursera because it promised a wide range of topics and I like the idea of online courses. Also, I had experience with R and I wanted to deepen my knowledge with this tool.

What are you most proud of doing as part of the JHU Data Science Specialization?

There are two things. I successfully took all nine courses in 4 months and the capstone project was really hard work.

How are you planning on using your Data Science Specialization Certificate?

I might get the chance to develop a Data Science department at my company. I like to use the certificate as basis to get a deeper knowledge in the many parts of Data Science.

Fred Zheng Zhenhao


ZHENG Zhenhao





By the time I enrolled in the JHU data science specialization, I was an undergraduate student in The Hong Kong Polytechnic university. Before that, I read some data mining books, feel excited about the content, but I never get to implement any of the algorithms because I barely have any programming skill. After taking this series of courses, now I am able to analyze the web content which is related to my research using R.

Why did you take the JHU Data Science Specialization?

I took this series of courses as a challenge to me. I would like to see whether my interest can support me through 9 courses and 1 capstone project. And I do want to learn more in this field. This specialization is different from other data mining or machine learning class in that it covers the entire process including the Git, R, R-Markdown, shiny etc, and I think these are necessary skills too.

What are you most proud of doing as part of the JHU Data Science Specialization?

Getting my word prediction app to respond in 0.05 seconds is already exiting, and one of the reviewer says "congratulations your engine came up with the most correct prediction among those I reviewed: 3 out of 5, including one that stumped every one else : "child might stick her finger or a foreign object into an electrical (outlet)". I guess that's the part I am most proud of.

How are you planning on using your Data Science Specialization Certificate?

It definitely goes in my CV for future job hunting.




Early data on knowledge units - atoms of statistical education

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Yesterday I posted about atomizing statistical education into knowledge units. You can try out the first knowledge unit here: The early data is in and it is consistent with many of our hypotheses about the future of online education.


  1. Completion rates are high when segments are shorter
  2. You can learn something about statistics in a short amount of time (2 minutes to complete, many people got all questions right)
  3. People will consume educational material on tablets/smartphones more and more.

Screen Shot 2015-02-05 at 9.34.51 AM



Knowledge units - the atoms of statistical education

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This idea is Brian's idea and based on conversations with him and Roger, but I just executed it.

The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.

Data collected from massive online open courses suggest that shrinking both the length of recorded lectures and the length of courses leads to higher student retention. These results line up with data on other online activities such as Youtube video watching or form completion, which also show that shorter activities lead to higher completion rates.

We have  some of the earliest and most highly subscribed massive online open courses through the Coursera platform: Data Analysis, Computing for Data Analysis, and Mathematical Biostatistics Bootcamp. Our original courses were translated from courses we offered locally and were therefore closer to semester long with longer lectures ranging from 15-30 minutes. Based on feedback from our students and the data we observed about completion rates, we made the decision to break our courses down into smaller, one-month courses with no more than two hours of lecture material per week. Since then, we have enrolled more than a million students in our MOOCs.

The data suggest that the shorter you can make an academic unit online, the higher the completion percentage. The question then becomes “How short can you make an online course?” To answer this question requires a definition of a course. For our purposes we will define a course as an educational unit consisting of the following three components:


  • Knowledge delivery - the distribution of educational material through lectures, audiovisual materials, and course notes.
  • Knowledge evaluation - the evaluation of how much of the knowledge delivered to a student is retained.
  • Knowledge certification - an independent claim or representation that a student has learned some set of knowledge.


A typical university class delivers 36 hours = 12 weeks x 3 hours/week of content knowledge, evaluates that knowledge based on the order of 10 homework assignments and 2 tests, and results in a certification equivalent to 3 university credits.With this definition, what is the smallest possible unit that satisfies all three definitions of a course? We will call this smallest possible unit one knowledge unit. The smallest knowledge unit that satisfies all three definitions is a course that:

  • Delivers a single unit of content - We will define a single unit of content as a text, image, or video describing a single concept.
  • Evaluates that single unit of content -  The smallest unit of evaluation possible is a single question to evaluate a student’s knowledge.
  • Certifies knowlege - Provides the student with a statement of successful evaluation of the knowledge in the knowledge unit.

An example of a knowledge unit appears here: The knowledge unit consists of a short (less than 2 minute) video and 3 quiz questions. When completed, the unit sends the completer an email verifying that the quiz has been completed. Just as an atom is the smallest unit of mass that defines a chemical element, the knowledge unit is the smallest unit of education that defines a course.

Shrinking the units down to this scale opens up some ideas about how you can connect them together into courses and credentials. I'll leave that for a future post.


Precision medicine may never be very precise - but it may be good for public health

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This post was originally titled: Personalized medicine is primarily a population health intervention. It has been updated with the graph of odds ratios/betas from GWAS studies.

There has been a lot of discussion of personalized medicine, individualized health, and precision medicine in the news and in the medical research community and President Obama just announced a brand new initiative in precision medicine . Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to "personalize" healthcare for those individuals.

So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that has recently been in the news is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.

This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:

  1. In individualized health/personalized medicine the "treatment" is information about risk. In some cases treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be "personalized" in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of understanding uncertainty.
  2. Individualized health/personalized medicine is a population-level treatment. Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the "personal" decision may not always be the "best" decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.

The first point bears serious consideration in light of President Obama's new proposal. We have already collected a massive amount of genetic data about a large number of common diseases. In almost all cases, the amount of predictive information that we can glean from genetic studies is modest. One paper pointed this issue out in a rather snarky way by comparing two approaches to predicting people's heights: (1) averaging their parents heights - an approach from the Victorian era and (2) combing the latest information on the best genetic markers at the time. It turns out, all the genetic information we gathered isn't as good as averaging parents heights. Another way to see this is to download data on all genetic variants associated with disease from the GWAS catalog that have a P-value less than 1 x 10e-8. If you do that and look at the distribution of effect sizes, you see that 95% have an odds ratio or beta coefficient less than about 4. Here is a histogram of the effect sizes:





This means that nearly all identified genetic effects are small. The ones that are really large (effect size greater than 100) are not for common disease outcomes, they are for Birdshot chorioretinopathy and hippocampal volume. You can really see this if you look at the bulk of the distribution of effect sizes, which are mostly less than 2 by zooming the plot on the x-axis:





These effect sizes translate into very limited predictive capacity for most identified genetic biomarkers.  The implication is that personalized medicine, at least for common diseases, is highly likely to be inaccurate for any individual person. But if we can take advantage of the population-level improvements in health from precision medicine by increasing risk literacy, improving our use of uncertain markers, and understanding that precision medicine isn't precise for any one person, it could be a really big deal.


Reproducible Research Course Companion

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Screen Shot 2015-01-26 at 4.14.26 PMI'm happy to announce that you can now get a copy of the Reproducible Research Course Companion from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my Reproducible Research course offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.

If you're interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.

Please note that all of the lectures are still available for free on YouTube via my YouTube channel. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.


Data as an antidote to aggressive overconfidence

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

A recent NY Times op-ed reminded us of the many biases faced by women at work. A followup op-ed  gave specific recommendations for how to conduct ourselves in meetingsIn general, I found these very insightful, but don't necessarily agree with the recommendations that women should "Practice Assertive Body Language".  Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call aggressive overconfidence.

Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist's lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is interrupted so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can't make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.

That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is highly correlated with being male, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.



Gorging ourselves on "free" health care: Harvard's dilemma

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This is a guest post by Laura Hatfield. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets @bioannie

I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be writing about my benefits package. Then a vocal and aggrieved group of faculty rebelled against health benefits changes for 2015, and commentators responded by gleefully skewering entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.

Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend $1000 or less, but a few very unfortunate folks top out near half a million.


Source: Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.

If instead of contributing to my premiums, Harvard instead gave me the $1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).

At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.

Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for $9.99?




One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible, $250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at $1500 per individual or $4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.

The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.

What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the Cadillac tax provision of the ACA. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably expect to make up that difference in higher wages. The empirical evidence is complicated and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.

As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel argues that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.