Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information: quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?
As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.
Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea. Happy data crunching! Data Collection/Synthesis Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling.
A growing tend in education is to put lectures online, for free. The Kahn Academy, Stanford’s recent AI course, and Gary King’s new quantitative government course at Harvard are three of the more prominent examples. This new pedagogical format is more democratic, free, and helps people learn at their own pace. It has led some, including us here at Simply Statistics, to suggest that the future of graduate education lies in online courses.
I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here. The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you.
In today’s Wall Street Journal, Amy Marcus has a piece on the Citizen Science movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big success stories. As more companies like Fitbit and 23andMe spring up, it is really easy to collect data about yourself (right Chris?
Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here. I asked John Muschelli and Andrew Jaffeto write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.
“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, described the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples.
I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a _ton_ of email. Some only see the subject lines on their phone before deleting them. I’ve picked up a few tricks for getting email responses from important/successful people: The SI Rules Try to send no more than one email a day.
A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs.
It looks like four major private health insurance companies will be releasing data for use by academic researchers. They will create a non-profit institute called the Health Care Cost Institute and deposit the data there. Researchers can request the data from the institute by (I’m guessing) writing a short proposal. Health insurance billing claims data might not sound all that exciting, but they are a gold mine of very interesting information about population health.
Thanks to Hilary Parker for pointing out Google Fusion Tables. The coolest thing here, from my self-centered spatial statistics point of view, is that it automatically geocodes locations for you. So you can upload a spreadsheet of addresses and it will map them for you on Google Maps. Unfortunately, there doesn’t seem to be an easy way to extract the latitude/longitude values, but I’m hoping that’s just a quick hack away….
Okay, this is not really about pre-cog, but just a pointer to some data that might be of interest to people. A number of cities post their crime data online, ready for scraping and data analysis. For example, the Baltimore Sun has a Google map of homicides in the city of Baltimore. There’s also some data for Oakland. Looking at the map is fun, but not particularly useful from a data analysis standpoint.