Tag: project ideas


A statistician loves the #insurancepoll...now how do we analyze it?

Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information:

quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?

This quick little poll struck a nerve with people and her Twitter feed blew up. Long story short, tons of interesting information was gathered from folks. This information is frequently kept semi-obscured, particularly what is the cost of health insurance for folks in different places. This isn’t the sort of info that insurance companies necessarily publicize widely and isn’t the sort of thing people talk about. 

The results were really fascinating and its worth reading the above blog post or checking out the hashtag: #insurancepoll. But the most fascinating thing for me as a statistician was thinking about how to analyze these data. @aubreyjaubrey is apparently collecting the data someplace, hopefully she’ll make it public. 

At least two key issues spring to mind:

  1. This is a massive convenience sample. 
  2. It is being collected through a social network

Although I’m sure there are more. If a student is looking for an amazingly interesting/rich data set and some seriously hard stats problems, they should get in touch with Aubrey and see if they can make something of it!


Statistics project ideas for students (part 2)

A little while ago I wrote a post on statistics projects ideas for students. In honor of the first Simply Statistics Coursera offering, Computing for Data Analysis, here is a new list of student projects for folks excited about trying out those new R programming skills. Again we have rated each project with my best guess difficulty and effort required. Happy computing!

Data Analysis

  1. Use city data to predict areas with the highest risk for parking tickets. Here is the data for Baltimore. (Difficulty: Moderate, Effort: Low/Moderate)
  2. If you have a Fitbit with a premium account, download the data into a spreadsheet (or get Chris’s data)  Then build various predictors using the data: (a) are you running or walking, (b) are you having a good day or not, (c) did you eat well that day or not, (d) etc. For special bonus points create a blog with your new discoveries and share your data with the world. (Difficulty: Depends on what you are trying to predict, Effort: Moderate with Fitbit/Jawbone/etc.)

Data Collection/Synthesis

  1. Make a list of skills associated with each component of the Data Scientist Venn Diagram. Then update the data scientist R function described in this post to ask a set of questions, then plot people on the diagram. Hint, check out the readline() function. (Difficulty: Moderately low, Effort:Moderate)
  2. HealthData.gov has a ton of data from various sources about public health, medicines, etc. Some of this data is super useful for projects/analysis and some of it is just data dumps. Create an R package that downloads data from healthdata.gov and gives some measures of how useful/interesting it is for projects (e.g. number of samples in the study, number of variables measured, is it summary data or raw data, etc.) (Difficulty: Moderately hard, Effort: High)
  3. Build an up-to-date aggregator of R tutorials/how-to videos, summarize/rate each one so that people know which ones to look at for learning which tasks. (Difficulty: Low, Effort: Medium)

Tool building

  1. Build software that creates a 2-d author list and averages people’s 2-d author lists. (Difficulty: Medium, Effort: Low)
  2. Create an R package that interacts with and downloads data from government websites and processes it in a way that is easy to analyze. (Difficulty: Medium, Effort: High)


The Killer App for Peer Review

A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. 

Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with Mendeley a free reference manager, they have launched an competition to build cool apps with their free data. 

Seems like with the right statistical analysis/cool features a recommender system for say, PLoS One could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper.