Simply Statistics


Joe Blitzstein's free online stat course helps put a critical satellite in orbit

As loyal readers know, we are very enthusiastic about MOOCs. One of the main reasons for this is the potential of teaching Statistics to students from all over the world, in particular those that can't afford or don't have acces to college. However, it turns out that rocket scientists can also benefit. Check out the feedback Joe Blitztsein, professor of one of the most popular online stat courses,  received from one of his students:

As an “old bubba” aerospace engineer I watched your Stat 110 class and enjoyed it very much. It sure blew out a lot of cobwebs that had collected over the past 35 years working as an aerospace engineer. As you might guess, we deal with a lot of probability. Just recently I was involved in a study to see what a blocked Reaction Control System (RCS) might do to a satellite… I am a Spacecraft Attitude Control systems engineer and it was my job to simulate what would happen if a certain RCS engine was plugged. It was a weird problem and it inspired me to watch your class… Fortunately, the statistics showed that the RCS nozzles that could get plugged would have a low probability and would not affect our ability to adjust the vehicle’s orbit. And we launched it this past summer and everything went perfect! So I just wanted to tell you that when you teach your “kiddos” tell them that Stat 110 has real life implications. This satellite is a critical national defense asset that saves the lives of our soldiers on the ground."

I doubt "Old Bubba" has time to go back to school to refresh his stats knowledge... but thanks to Joe's online class, he no longer needs to. This is yet another advantage MOOCs offer: giving busy professionals a practical way to learn new skills or brush up on specific topics.


Sunday data/statistics link roundup (12/9/12)

  1. Some interesting data/data visualizations about working conditions in the apparel industry. Here is the full report. Whenever I see reports like this, I wish the raw data were more clearly linked. I want to be able to get in, play with the data, and see if I notice something that doesn't appear in the infographics. 
  2. This is an awesome plain-language discussion of how a bunch of methods (CS and Stats) with fancy names relate to each other. It shows that CS/Machine Learning/Stats are converging in many ways and there isn't much new under the sun. On the other hand, I think the really exciting thing here is to use these methods on new questions, once people drop the stick
  3. If you are a reader of this blog and somehow do not read anything else on the internet, you will have missed Hadley Wickham's Rcpp tutorial. In my mind, this pretty much seals it, Julia isn't going to overtake R anytime soon. In other news, Hadley is coming to visit JHSPH Biostats this week! I'm psyched to meet him. 
  4. For those of us that live in Baltimore, this interesting set of data visualizations lets you in on the crime hotspots. This is a much fancier/more thorough analysis than Rafa and I did way back when. 
  5. Check out the new easy stats tool from the Census (via Hilary M.) and read our interview with Tom Louis who is heading over there to the Census to do cool things. 
  6. Watch out, some Tedx talks may be pseudoscience! More later this week on the politicization/glamourization of science, so stay tuned. 

"Dropping the Stick" in Data Analysis

When I was a kid growing up in rough-and-tumble suburban New York, one of the major summer activities was roller hockey, the kind with roller blades (remember roller blades?). My friends and I would be playing in some random parking lot and undoubtedly one of us would be just blowing it the whole game. This would usually lead to an impromptu intervention where the person screwing up (often me) would be told by everyone else on the team to "drop the stick". The idea was you should stop playing, clear your head, skate around for a bit, and not try to do 20 things at once.

I don't play much hockey now, but I do a bit more data analysis. Strangely, little has changed.

People come to me at various stages of data analysis. Close collaborators usually come to me with no data because they are planning a study and need some help. In those cases, I'm involved in the beginning and know how the data are generated. Usually, in those cases I analyze the data in the end so there's less confusion.

Others usually come to me with data in hand wanting know what they should do now that they've got all this data. Often there's confusion about where to start, what method to use, what program, what procedure, what function, what test, Bayesian or frequentist, mean or median, R or Stata, random effects or fixed effects, cat or dog, mice or men, etc. That's usually the point where I tell them to "drop the stick", or the data analysis version of that, which is "What question are you trying to answer?"

Usually, people know what question they're trying to answer--they just forgot to tell me. But I'm always amazed at how this question can often be the subject of the entire discussion. We might end up answering a question the investigator hadn't thought of yet, maybe a question that's better suited to the data.

So, job #1 if you're a statistician: Get more people to drop the stick.  You'll make everyone play better in the end.


Email is a to-do list made by other people - can someone make it more efficient?!

This is a follow-up to one of our most popular posts: getting email responses from busy people. This post had been in the drafts for a few weeks, then this morning I saw this quote in our Twitter feed:

Your email inbox is a to-do list created by other people (via)

This is 100% true of my work email and I have to say, because of the way those emails are organized - as conversations rather than a prioritized, organized to-do list - I end up missing really important things or getting to them too late. This is happening to me with increasing enough frequency I feel like I'm starting to cause serious problems for people.

So I am begging someone with way better skills than me to produce software that replaces gmail in the following ways. It is a to-do list that I can allow people to add tasks too. The software shows me the following types of messages.

  1. We have an appointment at x time on y date to discuss z. Next to this message is a checkbox. If I click “ok” it gets added to my calendar, if I click “no” then a message gets sent to the person who scheduled the meeting saying I’m unavailable.
  2. A multiple choice question where they input the categories of answer I can give and I just pick one, it sends them the response.
  3. A request to be added as a person who can assign me tasks with a yes/no answer.
  4. A longer request email - this has three entry fields: (1) what do you want, (2) when do you want it by? and (3) a yes/no checkbox asking if I’m willing to perform the task.  If I say yes, it gets added to my calendar with automated reminders.
  5. It should interface with all the systems that send me reminder emails to organize the reminders.
  6. You can assign quotas to people, where they can only submit a certain number of tasks per month.
  7. It allows you to re-assign tasks to other people so when I am not the right person to ask, I can quickly move the task on to the right person.
  8. It would collect data and generate automated reports for me about what kind of tasks I'm usually forgetting/being late on and what times of day I'm bad about responding so that I could improve my response times.

The software would automatically reorganize events/to-dos to reflect changing deadlines/priorities, etc. This piece of software would revolutionize my life. Any takers?


Advice for students on the academic job market (2013 edition)

Job hunting season is upon us. Those on the job market should be sending in applications already. Here I provide links to some of the related posts we published last year.


Data analysis acquisition "worst deal ever"?

A little over a year ago I mentioned that data analysis companies were getting gobbled up by larger technology companies. In particular, HP bought Autonomy, a British data analysis company, for about $11 billion. (By the way, can anyone tell me if it's still called Hewlett-Packard, or is it just "HP", like "AT&T"?) From an article a year ago

Autonomy, with headquarters in Cambridge, England, helps companies and governments store, process, search and analyze large electronic data sets. Its specialty lies in its sophisticated algorithms, which can make sense of unstructured information.

At the time, the thinking was HP had overpaid (especially given HP's recent high price for 3Par) but the deal went through anyway. Now, HP has discovered accounting problems at Autonomy and is writing down $8.8 billion.


James Stewart of the New York Times claims this is worse than the failed AOL-Time Warner merger (although the absolute numbers involved here are smaller). With 3 CEOs in 2 years, it seems HP just can't get anything right these days. But what intrigues me most is the question of what companies like Autonomy are worth and the possibility that HP made a huge mistake in the valuation of this company. Of course, if there was fraud at Autonomy (as it seems to be alleged), then all bets are off. But if not, then perhaps this is the first bubble popping in the realm of data analysis companies more generally?


Sunday data/statistics link roundup (12/2/12)

  1. An interview with Anthony Goldbloom, CEO of Kaggle. I'm not sure I'd agree with the characterization that all data scientists are: creative, curious, and competitive and certainly those characteristics aren't unique to data scientists. And I didn't know this: "We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000." 
  2. Check it out, art with R! It's actually pretty interesting to see how they use statistical algorithms to generate different artistic styles. Here are some more. 
  3. Now that Ethan Perlstein's crowdfunding experiment was successful, other people are getting on the bandwagon. If you want to find out what kind of bacteria you have in your gut, for example, you could check out this
  4. I thought I had it rough, but apparently some data analysts spend all their time developing algorithms to detect penis drawings!
  5. Roger was on Anderson Cooper 360 as part of the Building America segment. We can't find the video, but here is the transcript. 
  6. An interesting article on the half-life of facts. I think the analogy is an interesting one and certainly there is research to be done there. But I think it jumps the shark a bit when they start talking about how the moon landing was predictable, etc. I completely believe in the retrospective analysis of knowledge, but predicting things is pretty hard, especially when it is the future.