10
Jan

## The landscape of data analysis

I have been getting some questions via email, LinkedIn, and Twitter about the content of the Data Analysis class I will be teaching for Coursera. Data Analysis and Data Science mean different things to different people. So I made a video describing how Data Analysis fits into the landscape of other quantitative classes here:

Here is the corresponding presentation. I also made a tentative list of topics we will cover, subject to change at the instructor's whim. Here it is:

• The structure of a data analysis  (steps in the process, knowing when to quit, etc.)
• Types of data (census, designed studies, randomized trials)
• Types of data analysis questions (exploratory, inferential, predictive, etc.)
• How to write up a data analysis (compositional style, reproducibility, etc.)
• Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
• Exploratory statistical models (clustering)
• Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
• Basic model checking (primarily visually)
• The prediction process
• Study design for prediction
• Cross-validation
• A couple of simple prediction models
• Basics of simulation for evaluating models
• Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)

Of course that is a ton of material for 8 weeks and so obviously we will be covering just the very basics. I think it is really important to remember that being a good Data Analyst is like being a good surgeon or writer. There is no such thing as a prodigy in surgery or writing, because it requires long experience, trying lots of things out, and learning from mistakes. I hope to give people the basic information they need to get started and point to resources where they can learn more. I also hope to give them a chance to practice a couple of times some basics and to learn that in data analysis the first goal is to "do no harm".

18
Dec

## The value of re-analysis

I just saw this really nice post over on John Cook's blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).

In a more statistical version of this idea, Gary King has proposed reproducing the analysis in a published paper as a way to get a paper of your own.  You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:

1. Can you actually reproduce the code used to perform the analysis?
2. Can you solve the "paper as puzzle" exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?

Both of these things require a bit more "higher level thinking" than just re-running the analysis if you have the code. But I think even the seemingly "low-level" task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out Rpubs or RunMyCode or even the right parts of Figshare you can find data analyses you can run through and reproduce.

The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) said, "Data analysis practice doesn't make perfect, perfect data analysis practice makes perfect."

09
Dec

## Sunday data/statistics link roundup (12/9/12)

1. Some interesting data/data visualizations about working conditions in the apparel industry. Here is the full report. Whenever I see reports like this, I wish the raw data were more clearly linked. I want to be able to get in, play with the data, and see if I notice something that doesn't appear in the infographics.
2. This is an awesome plain-language discussion of how a bunch of methods (CS and Stats) with fancy names relate to each other. It shows that CS/Machine Learning/Stats are converging in many ways and there isn't much new under the sun. On the other hand, I think the really exciting thing here is to use these methods on new questions, once people drop the stick
3. If you are a reader of this blog and somehow do not read anything else on the internet, you will have missed Hadley Wickham's Rcpp tutorial. In my mind, this pretty much seals it, Julia isn't going to overtake R anytime soon. In other news, Hadley is coming to visit JHSPH Biostats this week! I'm psyched to meet him.
4. For those of us that live in Baltimore, this interesting set of data visualizations lets you in on the crime hotspots. This is a much fancier/more thorough analysis than Rafa and I did way back when.
5. Check out the new easy stats tool from the Census (via Hilary M.) and read our interview with Tom Louis who is heading over there to the Census to do cool things.
6. Watch out, some Tedx talks may be pseudoscience! More later this week on the politicization/glamourization of science, so stay tuned.
18
Jun

## Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon.
1. A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…
2. Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.
3. In data analysis - particularly for complex high-dimensional
data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).
20
May

## Sunday data/statistics link roundup (5/20)

It’s grant season around here so I’ll be brief:
1. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K.
2. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J.
3. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
4. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis.