Tag: diy


A statistician loves the #insurancepoll...now how do we analyze it?

Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information:

quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?

This quick little poll struck a nerve with people and her Twitter feed blew up. Long story short, tons of interesting information was gathered from folks. This information is frequently kept semi-obscured, particularly what is the cost of health insurance for folks in different places. This isn’t the sort of info that insurance companies necessarily publicize widely and isn’t the sort of thing people talk about. 

The results were really fascinating and its worth reading the above blog post or checking out the hashtag: #insurancepoll. But the most fascinating thing for me as a statistician was thinking about how to analyze these data. @aubreyjaubrey is apparently collecting the data someplace, hopefully she’ll make it public. 

At least two key issues spring to mind:

  1. This is a massive convenience sample. 
  2. It is being collected through a social network

Although I’m sure there are more. If a student is looking for an amazingly interesting/rich data set and some seriously hard stats problems, they should get in touch with Aubrey and see if they can make something of it!


A deterministic statistical machine

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$. 

So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare

The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around. 

The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better? 


Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.

Happy data crunching!

Data Collection/Synthesis

  1. Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
  2. Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like WordPress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)

Data Analyses

  1. Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
  2. You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
  3. Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
  4. Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish
  5. Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).

Data visualization

  1. Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish). 
  2. The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)

Why in-person education isn't dead yet...but a statistician could finish it off

A growing tend in education is to put lectures online, for free. The Kahn Academy, Stanford’s recent AI course, and Gary King’s new quantitative government course at Harvard are three of the more prominent examples. This new pedagogical format is more democratic, free, and helps people learn at their own pace. It has led some, including us here at Simply Statistics, to suggest that the future of graduate education lies in online courses. Or to forecast the end of in-class lectures

All this excitement led John Cook to ask, “What do colleges sell?”. The answers he suggested were: (1) real credentials, like a degree, (2) motivation to ensure you did the work, and (3) feedback to tell you how you are doing. As John suggests, online lectures really only target motivated and self-starting learners. For graduate students, this may work (maybe), but for the vast majority of undergrads or high-school students, self-guided learning won’t work due to a lack of motivation. 

I would suggest that until the feedback, assessment,and credentialing problems have been solved, online lectures are still more edu-tainment than education. 

Of these problems, I think we are closest to solving the feedback problem with online quizes and tests to go with online lectures. What we haven’t solved are assessment and credentialing. The reason is there is no good system for verifying a person taking a quiz/test online is who they say they are. This issue has two consequences: (1) it is difficult to require that a person do online quizes/tests like we do with in-class quizes/tests and (2) it is difficult to believe credentials given to people who take courses online. 

What does this have to do with statistics? Well, what we need is an Completely Automated Online Test for Student Identity (COATSI). People will notice a similarity between my acronym and the acronym for CAPTCHAs, the simple online Turing tests used to prove that you are a human and not a computer. 

The properties of a COATSI need to be:

  1. Completely automated
  2. Provide tests that verify the identity of the student being assessed
  3. Can be used throughout an online quiz/test/assessment
  4. Are simple and easy to solve

I can’t think of a deterministic system that can be used for this purpose. My suspicion is that a COATSI will need to be statistical. For example, one idea is to have people sign in with Facebook, then at random intervals while they are solving problems, they have to identify their friends by name. If they do this quickly/consistently enough, they are verified as the person taking the test. 

I don’t have a good solution to this problem yet; I’d love to hear more suggestions. I also think this seems like a potentially hugely important and very challenging problem for a motivated grad student or postdoc….


An R function to map your Twitter Followers

I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here

The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you. The code also requires you to have a working internet connection. 

One word of warning is that if you have a large number of followers or people you follow, you may be rate limited by Twitter and unable to make the plot.

To make your personalized twitter map, first source the function:

> source(“http://biostat.jhsph.edu/~jleek/code/twitterMap.R”)

The function has the following form: 

twitterMap <- function(userName,userLocation=NULL,fileName=”twitterMap.pdf”,nMax = 1000,plotType=c(“followers”,”both”,”following”))

with arguments:

  • userName - the twitter username you want to plot
  • userLocation - an optional argument giving the location of the user, necessary when the location information you have provided Twitter isn’t sufficient for us to find latitude/longitude data
  • fileName - the file where you want the plot to appear
  • nMax - The maximum number of followers/following to get from Twitter, this is implemented to avoid rate limiting for people with large numbers of followers. 
  • plotType - if “both” both followers/following are plotted, etc. 

Then you can create a plot with both followers/following like so: 

> twitterMap(“simplystats”)

Here is what the resulting plot looks like for our Twitter Account:

If your location can’t be found or latitude longitude can’t be calculated, you may have to chose a bigger city near you. The list of cities used by twitterMap can be found like so:



>grep(“Baltimore”, world.cities[,1])

If your city is in the database, this will return the row number of the world.cities data frame corresponding to your city. 

If you like this function you may also like our function to determine if you are a data scientist or to analyze your Google Scholar citations page.
Update: The bulk of the heavy lifting done by these functions is performed by Jeff Gentry’s very nice twitteR package and code put together by Nathan Yau over at FlowingData. This is really an example of standing on the shoulders of giants. 

Citizen science makes statistical literacy critical

In today’s Wall Street Journal, Amy Marcus has a piece on the Citizen Science movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big success stories. As more companies like Fitbit and 23andMe spring up, it is really easy to collect data about yourself (right Chris?). At the same time organizations like Patients Like Me make it possible for people with specific diseases or experiences to self-organize. 

But the thing that struck me the most in reading the article is the importance of statistical literacy for citizen scientists, reporters, and anyone reading these articles. For example the article says:

The questions that most people have about their DNA—such as what health risks they face and how to prevent them—aren’t always in sync with the approach taken by pharmaceutical and academic researchers, who don’t usually share any potentially life-saving findings with the patients.

I think its pretty unlikely that any organization would hide life-saving findings from the public. My impression from reading the article is that this statement refers to keeping results blinded from patients/doctors during an experiment or clinical trial. Blinding is a critical component of clinical trials, which reduces many potential sources of bias in the results of a study. Obviously, once the trial/study has ended (or been stopped early because a treatment is effective) then the results are quickly disseminated.

Several key statistical issues are then raised in bullet-point form without discussion: 

Amateurs may not collect data rigorously, they say, and may draw conclusions from sample sizes that are too small to yield statistically reliable results. 

Having individuals collect their own data poses other issues. Patients may enter data only when they are motivated, or feeling well, rendering the data useless. In traditional studies, both doctors and patients are typically kept blind as to who is getting a drug and who is taking a placebo, so as not to skew how either group perceives the patients’ progress.

The article goes on to describe an anecdotal example of citizen science - which suffers from a key statistical problem (small sample size):

Last year, Ms. Swan helped to run a small trial to test what type of vitamin B people with a certain gene should take to lower their levels of homocysteine, an amino acid connected to heart-disease risk. (The gene affects the body’s ability to metabolize B vitamins.)

Seven people—one in Japan and six, including herself, in her local area—paid around $300 each to buy two forms of vitamin B and Centrum, which they took in two-week periods followed by two-week “wash-out” periods with no vitamins at all.

The article points out the issue:

The scientists clapped politely at the end of Ms. Swan’s presentation, but during the question-and-answer session, one stood up and said that the data was not statistically significant—and it could be harmful if patients built their own regimens based on the results.

But doesn’t carefully explain the importance of sample size, suggesting instead that the only reason why you need more people is “insure better accuracy”. 

It strikes me that statistical literacy is critical if the citizen science movement is going to go forward. Ideas like experimental design, randomization, blinding, placebos, and sample size need to be in the toolbox of any practicing citizen scientist. 

One major drawback is that there are very few places where the general public can learn about statistics. Mostly statistics is taught in university courses. Resources like the Kahn Academy and the Cartoon Guide to Statistics exist, but are only really useful if you are self motivated and have some idea of math/statistics to begin with. 

Since knowledge of basic statistical concepts is quickly becoming indispensable for citizen science or even basic life choices like deciding on healthcare options, do we need “adult statistical literacy courses”? These courses could focus on the basics of experimental design and how to understand results in stories about science in the popular press. It feels like it might be time to add a basic understanding of statistics and data to reading/writing/arithmetic as critical life skills. I’m not the only one who thinks so.


An R function to analyze your Google Scholar Citations page

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here.

I asked John Muschelli and Andrew Jaffe to write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.  

So how does it work? Here is the code. You can source the functions like so:


This will install the following packages if you don’t have them: wordcloud, tm, sendmailR, RColorBrewer. Then you need to find the url of a google scholar citation page. Here is Rafa Irizarry’s:


You can then call the googleCite function like this:

out = googleCite(“http://scholar.google.com/citations?user=nFW-2Q8AAAAJ&hl=en”,pdfname=”rafa_wordcloud.pdf”)

or search by name like this:

out = searchCite(“Rafa Irizarry”,pdfname=”rafa_wordcloud.pdf”)

The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want). Here is what Rafa’s clouds look like: 

We have also written a little function to calculate many of the popular citation indices. You can call it on the output like so:


When you download citation data, an email with the data table will also be sent to Simply Statistics so we can collect information on who is using the function and perform population-level analyses. 

If you liked this function you might also be interesting in our R function to determine if you are a data scientist, or in some of the other stuff going on over at Simply Statistics



An R function to determine if you are a data scientist

“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, described the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples. 

The code can be found here. You will need the png and klaR R packages to run the script. You also need to either download the file datascience.png or be connected to the internet. 

Here is the function definition:

dataScientist(names=c(“D. Scientist”),skills=matrix(rep(1/3,3),nrow=1), addSS=TRUE, just=NULL)

  • names = a character vector of the names of the people to plot
  • addSS = if TRUE will add the blog authors to the plot
  • just = whether to write the name on the right or the left of the point, just = “left” prints on the left and just =”right” prints on the right. If just=NULL, then all names will print to the right. 
  • skills = a matrix with one row for each person you are plotting, the first column corresponds to “hacking”, the second column is “substantive expertise”, and the third column is “math and statistics knowledge”

So how do you define your skills? Here is how it works:

If you are an academic

You calculate your skills by adding papers in journals. The classification scheme is the following:

  • Hacking = sum of papers in journals that are primarily dedicated to software/computation/methods for very specific problems. Examples are: Bioinformatics, Journal of Statistical Software, IEEE Computing in Science and Engineering, or a software article in Genome Biology.
  • Substantive  = sum of papers in journals that primarily publish scientific results such as JAMA, New England Journal of Medicine, Cell, Sleep, Circulation
  • Math and Statistics = sum of papers in primarily statistical journals including Biostatistics, Biometrics, JASA, JRSSB, Annals of Statistics

Some journals are general, like Nature, Science, the Nature sub-journals, PNAS, and PLoS One. For papers in those journals, assess which of the areas the paper falls in by determining the main contribution of the paper in terms of the non-academic classification below. 

If you are a non-academic

Since papers aren’t involved, determine the percent of your time you spend on the following things:

  • Hacking = downloading/transferring data, cleaning data, writing software, combining previously used software
  • Substantive = time you spend learning about the scientific problem, discussing with scientists, working in the lab/field.
  • Math and Statistics = time you spend formalizing a problem in mathematical notation, time you spend developing new mathematical/statistical theory, time you spend developing general method.



Getting email responses from busy people

I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a ton of email. Some only see the subject lines on their phone before deleting them. 

I’ve picked up a few tricks for getting email responses from important/successful people:  

The SI Rules

  1. Try to send no more than one email a day. 
  2. Emails should be 3 sentences or less. Better if you can get the whole email in the subject line. 
  3. If you need information, ask yes or no questions whenever possible. Never ask a question that requires a full sentence response.
  4. When something is time sensitive, state the action you will take if you don’t get a response by a time you specify. 
  5. Be as specific as you can while conforming to the length requirements. 
  6. Bonus: include obvious keywords people can use to search for your email. 

Anecdotally, SI emails have a 10-fold higher response probability. The rules are designed around the fact that busy people who get lots of email love checking things off their list. SI emails are easy to check off! That will make them happy and get you a response. 

It takes more work on your end when writing an SI email. You often need to think more carefully about what to ask, how to phrase it succinctly, and how to minimize the number of emails you write. A surprising side effect of applying SI principles is that I often figure out answers to my questions on my own. I have to decide which questions to include in my SI emails and they have to be yes/no answers, so I end up taking care of simple questions on my own. 

Here are examples of SI emails just to get you started: 

Example 1

Subject: Is my response to reviewer 2 ok with you?

Body: I’ve attached the paper/responses to referees.

Example 2

Subject: Can you send my letter of recommendation to john.doe@someplace.com?


Keywords = recommendation, Jeff, John Doe.

Example 3

Subject: I revised the draft to include your suggestions about simulations and language

Revisions attached. Let me know if you have any problems, otherwise I’ll submit Monday at 2pm. 


The Killer App for Peer Review

A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. 

Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with Mendeley a free reference manager, they have launched an competition to build cool apps with their free data. 

Seems like with the right statistical analysis/cool features a recommender system for say, PLoS One could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper.