Simply Statistics


The 5 Most Critical Statistical Concepts

It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas. 

With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from very mathematical to very applied. An obvious question is: what are the most critical skills needed by statisticians? 

So just for fun, I made up my list of the top 5 most critical skills for a statistician by my own definition. They are by necessity very general (I only gave myself 5). 

  1. The ability to manipulate/organize/work with data on computers - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data. 
  2. A knowledge of exploratory data analysis - how to make plots, how to discover patterns with visualizations, how to explore assumptions
  3. Scientific/contextual knowledge - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians. 
  4. Skills to distinguish true from false patterns - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means. 
  5. The ability to communicate results to people without math skills - a key component of being a statistician is knowing how to explain math/plots/analyses.

What are your top 5? What order would you rank them in? Even though these are so general, I almost threw regression in there because of how often it pops up in various forms. 

Related Posts: Rafa on graduate education and What is a Statistician? Roger on “Do we really need applied statistics journals?”


Computing on the Language

And now for something a bit more esoteric….

I recently wrote a function to deal with a strange problem. Writing the function ended up being a fun challenge related to computing on the R language itself.

Here’s the problem: Write a function that takes any number of R objects as arguments and returns a list whose names are derived from the names of the R objects.

Perhaps an example provides a better description. Suppose the function is called ‘makeList’. Then 

x <- 1
y <- 2
z <- "hello"
makeList(x, y, z)


list(x = 1, y = 2, z = "hello")

It originally seemed straightforward to me, but it turned out to be very much not straightforward. 

Note that a function like this is probably most useful during interactive sessions, as opposed to programming.

I challenge you to take a whirl at writing the function, you know, in all that spare time you have. I’ll provide my solution in a future post.


Visualizing Yahoo Email

Here is a cool page where yahoo shows you the email it is processing in real time. It includes a visualization of the most popular words in emails at a given time. A pretty neat tool and definitely good for procrastination, but I’m not sure what else it is good for…



The internet is the greatest source of publicly available data. One of the key skills to being able to obtain data from the web is “web-scraping”, where you use a piece of software to run through a website and collect information. 

This technique can be used for collecting data from databases or to collect data that is scattered across a website. Here is a very cool little exercise in web-scraping that can be used as an example of the things that are possible. 

Related Posts: Jeff on APIs, Data Sources, Regex, and The Open Data Movement.


Archetypal Athletes

Here is a cool paper on the ArXiv about archetypal athletes. The basic idea is to look at a large number of variables for each player and identify multivariate outliers or extremes. These outliers are the archetypes talked about in the title. 

According to his analysis the author claims the best players (for different reasons, i.e. different archetypes) in the NBA in 2009/2010 were:  Taj Gibson, Anthony Morrow, and Kevin Durant. The best soccer players were Wayne Rooney, Leonel Messi, and Christiano Ronaldo.

Thanks to Andrew Jaffe for pointing out the article. 

Related Posts: Jeff on “Innovation and Overconfidence”, Rafa on “Once in a lifetime collapse


Graduate student data analysis inspired by a high-school teacher

I love watching TED talks. One of my absolute favorites is the talk by Dan Meyer on how math class needs a makeover. Dan also has one of the more fascinating blogs I have read. He talks about math education, primarily K-12 education.  His posts on curriculum design, assessment , work ethic, and homework are really, really good. In fact, just go read all his author choices. You won’t regret it. 

The best quote from the talk is:

Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew knew all of the given information in advance? Where you didn’t have a surplus of information and have to filter it out, or you didn’t have insufficient information and have to go find some?

Many of the data analyses I have done in classes/assigned in class have focused on a problem with exactly the right information with relatively little extraneous data or missing information. But I have been slowly evolving these problems; as an example here is a data analysis project that we developed last year for the qualifying exam at JHU. This project is what I consider a first step toward a “less helpful” project model. 

The project was inspired by this blog post at marginal revolution which Rafa suggested. As with the homework problem Dan dissects in his talk, there are layers to this problem:

  1. Understanding the question
  2. Downloading and filtering the data
  3. Exploratory analysis
  4. Fitting models/interpreting results
  5. Synthesis and writing the results up
  6. Reproducibility of the R code

For this analysis, I was pretty specific with 1. Understanding the question:

(1) The association between enrollment and the percent of students scoring “Advanced” on the MSA in Reading and Math in the 5th grade.

(2) The change in the number of students scoring “Advanced” in Reading and Math from one year to the next (at minimum consider the change from 2009-2010) versus enrollment.

(3) Potential reasons for results like those in Table 1.  

Although I didn’t mention the key idea from the Marginal Revolution post. I think for a qualifying exam, this level of specificity is necessary, but for an in-class project I think I would have removed this information so students had to “discover the question” themselves. 

I was also pretty specific with the data source suggesting the Maryland Education department’s website. However, several students went above and beyond and found other data sources, or downloaded more data than I suggested. In the future, I think I will leave this off too. My google/data finding skills don’t hold a candle to those of my students. 

Steps 3-5 were summed up with the statement: 

Your project is to analyze data from the MSA and write a short letter either in favor of or against spending money to decrease school sizes.

This is one part of the exam I’m happy with. It is sufficiently vague to let the students come to their own conclusions. It also suggests that the students should draw conclusions and support them with statistical analyses. One of the major difficulties I have struggled with in teaching this class is getting students to state a conclusion as a result of their analysis and to quantify how uncertain they are about that decision. In my mind, this is different from just the uncertainty associated with a single parameter estimate. 

It  was surprising how much requiring reproducibility helped students focus their analyses. I think because they had to organize/collect their code which, helped them organize their analysis. Also, there was a strong correlation between reproducibility and quality of the written reports.

Going forward I have a couple of ideas of how I would change my data analysis projects:

  1. Be less helpful - be less clear about the problem statement, data sources, etc. I definitely want students to get more practice formulating problems. 
  2. Focus on writing/synthesis - my students are typically very good at fitting models, but sometimes struggle with putting together the “story” of an analysis. 
  3. Stress much less about whether specific methods will work well on the data analyses I suggest. One of the more helpful things I think these messy problems produce is a chance to figure out what works and what doesn’t on real world problems. 

Related Posts: Rafa on the future of graduate education, Roger on applied statistics journals.


The self-assessment trap

Several months ago I was sitting next to my colleague Ben Langmead at the Genome Informatics meeting. Various talks were presented on short read alignments and every single performance table showed the speaker’s method as #1 and Ben’s Bowtie as #2 among a crowded field of lesser methods. It was fun to make fun of Ben for getting beat every time, but the reality was that all I could conclude was that Bowtie was best and speakers were falling into the the self-assessment trap: each speaker had tweaked the assessment to make their method look best. This practice is pervasive in Statistics where easy-to-tweak Monte Carlo simulations are commonly used to assess performance. In a recent paper, a team at IBM described how the problem in the systems biology literature is pervasive as well. Co-author Gustavo Stolovitzky Stolovitsky is a co-developer of the DREAM challenge in which the assessments are fixed and developers are asked to submit. About 7 years ago we developed affycomp, a comparison webtool for microarray preprocessing methods. I encourage others involved in fields where methods are constantly being compared to develop such tools. It’s a lot of work, but journals are usually friendly to papers describing the results of such competitions.

Related Posts:  Roger on colors in R, Jeff on battling bad science


Interview With Chris Barr

Chris Barr

Chris Barr is an assistant professor of biostatistics at the Harvard School of Public Health in Boston. He moved to Boston after getting his Ph.D. at UCLA and then doing a postdoc at Johns Hopkins Bloomberg School of Public Health. Chris has done important work in environmental biostatistics and is also the co-founder of OpenIntro, a very cool open-source (and free!) educational resource for statistics.  

 Which term applies to you: data scientist/statistician/analyst?

I’m a “statistician” by training. One day, I hope to graduate to “scientist”. The distinction, in my mind, is that a scientist can bring real insight to a tough problem, even when the circumstances take them far beyond their training.

 Statisticians get a head start on becoming scientists. Like chemists and economists and all the rest, we were trained to think hard as independent researchers. Unlike other specialists, however, we are given the opportunity, from a young age, to see all types of different problems posed from a wide range of perspectives.

How did you get into statistics/data science (e.g. your history)?

I studied economics in college, and I had planned to pursue a doctorate in the same field. One day a senior professor of statistics asked me about my future, and in response to my stated ambition, said: “Whatever an economist can do, a statistician can do better.” I started looking at graduate programs in statistics and noticed UCLA’s curriculum. It was equal parts theory, application, and computing, and that sounded like how I wanted to spend my next few years. I couldn’t have been luckier. The program and the people were fantastic.

What is the problem currently driving you?

I’m working on so many projects, it’s difficult to single out just one. Our work on smoking bans (joint with Diez, Wang, Samet, and Dominici) has been super exciting. It is a great example about how careful modeling can really make a big difference. I’m also soloing a methods paper on residual analysis for point process models that is bolstered by a simple idea from physics. When I’m not working on research, I spend as much time as I can on OpenIntro.

What is your favorite paper/idea you have had? Why?

 I get excited about a lot of the problems and ideas. I like the small teams (one, two, or three authors) that generally take on theory and methods problems; I also like the long stretches of thinking time that go along with those papers. That said, big science papers, where I get to team up with smart folks from disciplines and destinations far and wide, really get me fired up. Last, but not least, I really value the work we do on open source education and reproducible research. That work probably has the greatest potential for introducing me to people, internationally and in small local communities, that I’d never know otherwise.

Who were really good mentors to you? What were the qualities that really helped you?

Identifying key mentors is such a tough challenge, so I’ll adhere to a self-imposed constraint by picking just one: Rick Schoenberg. Rick was my doctoral advisor, and has probably had the single greatest impact on my understanding of what it means to be a scientist and colleague. I could tell you a dozen stories about the simple kindness and encouragement that Rick offered. Most importantly, Rick was positive and professional in every interaction we ever had. He was diligent, but relaxed. He offered structure and autonomy. He was all the things a student needs, and none of the things that make students want to read those xkcd comics. Now that I’m starting to make my own way, I’m grateful to Rick for his continuing friendship and collaboration.

I know you asked about mentors, but if I could mention somebody who, even though not my mentor, has taught me a ton, it would be David Diez. David was my classmate at UCLA and colleague at Harvard. We are also cofounders of OpenIntro. David is probably the hardest working person I know. He is also the most patient and clear thinking. These qualities, like Rick’s, are often hard to find in oneself and can never be too abundant.

 What is OpenIntro?

OpenIntro is part of the growing movement in open source education. Our goal, with the help of community involvement, is to improve the quality and reduce the cost of educational materials at the introductory level. Founded by two statisticians (Diez, Barr), our early activities have generated a full length textbook (OpenIntro Statistics: Diez, Barr, Cetinkaya-Rundel) that is available for free in PDF and at cost ($9.02) in paperback. People can also use to manage their course materials for free, whether they are using our book or not. The software, developed almost entire by David Diez, makes it easy for people to post lecture notes, assignments, and other resources. Additionally, it gives people access to our online question bank and quiz utility. Last but not least, we are sponsoring a student project competition. The first round will be this semester, and interested people can visit for additional information. We are little fish, but with the help of our friends ( and involvement from the community, we hope to do a good thing.

How did you get the idea for OpenIntro?


 Regarding the book and webpage - David and I had both started writing a book on our own; David was keen on an introductory text, and I was working on one about statistical computing. We each realized that trying to solo a textbook while finishing a PhD was nearly impossible, so we teamed up. As the project began to grow, we were very lucky to be joined by Mine Cetinkaya-Rundel, who became our co-author on the text and has since played a big role in developing the kinds of teaching supplements that instructors find so useful (labs and lecture notes to name a few). Working with the people at OpenIntro has been a blast, and a bucket full of nights and weekends later, here we are!

 Regarding making everything free - David and I started the OpenIntro project during the peak of the global financial crisis. With kids going to college while their parents’ house was being foreclosed, it seemed timely to help out the best way we knew how. Three years later, as I write this, the daily news is running headline stories about the Occupy Wall Street movement featuring hard times for young people in America and around the world. Maybe “free” will always be timely.

For More Information

Check out Chris’ webpage, his really nice publications including this one on the public health benefits of cap and trade, and the OpenIntro project website. Keep your eye open for the paper on cigarette bans Chris mentions in the interview, it is sure to be good. 

Related Posts: Jeff’s interview with Daniela Witten, Rafa on the future of graduate education, Roger on colors in R.


Anthropology of the Tribe of Statisticians

From the BBC a pretty fascinating radio piece.

…in the same way that a telescope enables you to see things that are too far away to see with the naked eye, a microscope enables you to see things that are too small to see with the naked eye, statistics enables you to see things in masses of data which are too complex for you to see with the naked eye. 


Finding good collaborators

The job of the statistician is almost entirely about collaboration. Sure, there’s theoretical work that we can do by ourselves, but most of the impact that we have on science comes from our work with scientists in other fields. Collaboration is also what makes the field of statistics so much fun.

So one question I get a lot from people is “how do you find good collaborations”? Or, put another way, how do you find good collaborators? It turns out this distinction is more important than it might seem.

My approach to developing collaborations has evolved over time and I consider myself fairly lucky to have developed a few very productive and very enjoyable collaborations. These days my strategy for finding good collaborations is to look for good collaborators. I personally find it important to work with people that I like as well as respect as scientists, because a good collaboration is going to involve a lot of personal interaction. A place like Johns Hopkins has no shortage of very intelligent and very productive researchers that are doing interesting things, but that doesn’t mean you want to work with all of them.

Here’s what I’ve been telling people lately about finding collaborations, which is a mish-mash of a lot of advice I’ve gotten over the years.

  1. Find people you can work with. I sometimes see situations where a statistician will want to work with someone because he/she is working on an important problem. Of course, you want to be working on a problem that interests you, but it’s only partly about the specific project. It’s very much about the person. If you can’t develop a strong working relationship with a collaborator, both sides will suffer. If you don’t feel comfortable asking (stupid) questions, pointing out problems, or making suggestions, then chances are the science won’t be as good as it could be. 
  2. It’s going to take some time. I sometimes half-jokingly tell people that good collaborations are what you’re left with after getting rid of all your bad ones. Part of the reasoning here is that you actually may not know what kinds of people you are most comfortable working with. So it takes time and a series of interactions to learn these things about yourself and to see what works and doesn’t work. Of course, you can’t take forever, particularly in academic settings where the tenure clock might be ticking, but you also can’t rush things either. One rule I heard once was that a collaboration is worth doing if it will likely end up with a published paper. That’s a decent rule of thumb, but see my next comment.
  3. It’s going to take some time. Developing good collaborations will usually take some time, even if you’ve found the right person. You might need to learn the science, get up to speed on the latest methods/techniques, learn the jargon, etc. So it might be a while before you can start having intelligent conversations about the subject matter. Then it takes time to understand how the key scientific questions translate to statistical problems. Then it takes time to figure out how to develop new methods to address these statistical problems. So a good collaboration is a serious long-term investment which has some risk of not working out.  There may not be a lot of papers initially, but the idea is to make the early investment so that truly excellent papers can be published later.
  4. Work with people who are getting things done. Nothing is more frustrating than collaborating on a project with someone who isn’t that interested in bringing it to a close (i.e. a published paper, completed software package). Sometimes there isn’t a strong incentive for the collaborator to finish (i.e she/he is already tenured) and other times things just fall by the wayside. So finding a collaborator who is continuously getting things done is key. One way to determine this is to check out their CV. Is there a steady stream of productivity? Papers in good journals? Software used by lots of other people? Grants? Web site that’s not in total disrepair?
  5. You’re not like everyone else. One thing that surprised me was discovering that just because someone you know works well with a specific person doesn’t mean that you will work well with that person. This sounds obvious in retrospect, but there were a few situations where a collaborator was recommended to me by a source that I trusted completely, and yet the collaboration didn’t work out. The bottom line is to trust your mentors and friends, but realize that differences in personality and scientific interests may determine a different set of collaborators with whom you work well.

These are just a few of my thoughts on finding good collaborators. I’d be interested in hearing others’ thoughts and experiences along these lines.

Related Posts: Rafa on authorship conventions, finish and publish