Simply Statistics: Graduate student data analysis inspired by a high-school teacher

I love watching TED talks. One of my absolute favorites is the talk by Dan Meyer on how math class needs a makeover. Dan also has one of the more fascinating blogs I have read. He talks about math education, primarily K-12 education. His posts on curriculum design, assessment , work ethic, and homework are really, really good. In fact, just go read all his author choices. You won’t regret it.

The best quote from the talk is:

Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?

Many of the data analyses I have done in classes/assigned in class have focused on a problem with exactly the right information with relatively little extraneous data or missing information. But I have been slowly evolving these problems; as an example here is a data analysis project that we developed last year for the qualifying exam at JHU. This project is what I consider a first step toward a “less helpful” project model.

The project was inspired by this blog post at marginal revolution which Rafa suggested. As with the homework problem Dan dissects in his talk, there are layers to this problem:

Understanding the question
Downloading and filtering the data
Exploratory analysis
Fitting models/interpreting results
Synthesis and writing the results up
Reproducibility of the R code

For this analysis, I was pretty specific with 1. Understanding the question:

(1) The association between enrollment and the percent of students scoring “Advanced” on the MSA in Reading and Math in the 5^th grade.

(2) The change in the number of students scoring “Advanced” in Reading and Math from one year to the next (at minimum consider the change from 2009-2010) versus enrollment.

(3) Potential reasons for results like those in Table 1.

Although I didn’t mention the key idea from the Marginal Revolution post. I think for a qualifying exam, this level of specificity is necessary, but for an in-class project I think I would have removed this information so students had to “discover the question” themselves.

I was also pretty specific with the data source suggesting the Maryland Education department’s website. However, several students went above and beyond and found other data sources, or downloaded more data than I suggested. In the future, I think I will leave this off too. My google/data finding skills don’t hold a candle to those of my students.

Steps 3-5 were summed up with the statement:

Your project is to analyze data from the MSA and write a short letter either in favor of or against spending money to decrease school sizes.

This is one part of the exam I’m happy with. It is sufficiently vague to let the students come to their own conclusions. It also suggests that the students should draw conclusions and support them with statistical analyses. One of the major difficulties I have struggled with in teaching this class is getting students to state a conclusion as a result of their analysis and to quantify how uncertain they are about that decision. In my mind, this is different from just the uncertainty associated with a single parameter estimate.

It was surprising how much requiring reproducibility helped students focus their analyses. I think because they had to organize/collect their code which, helped them organize their analysis. Also, there was a strong correlation between reproducibility and quality of the written reports.

Going forward I have a couple of ideas of how I would change my data analysis projects:

Be less helpful - be less clear about the problem statement, data sources, etc. I definitely want students to get more practice formulating problems.
Focus on writing/synthesis - my students are typically very good at fitting models, but sometimes struggle with putting together the “story” of an analysis.
Stress much less about whether specific methods will work well on the data analyses I suggest. One of the more helpful things I think these messy problems produce is a chance to figure out what works and what doesn’t on real world problems.

Related Posts: Rafa on the future of graduate education, Roger on applied statistics journals.