On the scalability of statistical procedures: why the p-value bashers just don't get it.

Executive Summary

  1. The problem is not p-values it is a fundamental shortage of data analytic skill.
  2. In general it makes sense to reduce researcher degrees of freedom for non-experts, but any choice of statistic, when used by many untrained people, will be flawed.
  3. The long term solution is to require training in both statistics and data analysis for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.
  4. The Johns Hopkins Specialization in Data Science runs every month and can be easily integrated into any program. Other, more specialized, online courses and short courses make it possible to round this training out in ways that are appropriate for each discipline.

Scalability of Statistical Procedures

The P-value is in the news again. Nature came out with a piece talking about how scientists are naive about the use of P-values among other things. P-values have known flaws which have been regularly discussed. If you want to see some criticisms just Google "NHST". Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis. 

Data are now abundant in nearly every discipline from astrophysics, to biology, to the social sciences, and even in qualitative disciplines like literature. By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven't adapted to this new era. This is particularly true in genomics where within one generation we went from a data poor environment to a data rich environment. Many of the people in positions of authority were trained before data were widely available and used.

The result is that the vast majority of people performing statistical and data analysis are people with only one or two statistics classes and little formal data analytic training under their belt. Many of these scientists would happily work with a statistician, but as any applied statistician at a research university will tell you, it is impossible to keep up with the demand from our scientific colleagues. Everyone is collecting major data sets or analyzing public data sets; there just aren't enough hours in the day.

Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many "researcher degrees of freedom". If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.

The short-term solution is to find a balance between researcher degrees of freedom and "recipe book" style approaches that require a specific method to be applied. In general, for naive analysts, it makes sense to lean toward less flexible methods that have been shown to work across a range of settings. The key idea here is to evaluate methods in the hands of naive users and see which ones work best most frequently, an idea we have previously called "evidence based data analysis".

An incredible success story of evidence based data analysis in genomics is the use of the limma package for differential expression analysis of microarray data. Limma can be beat in certain specific scenarios, but it is robust to such a wide number of study designs, sample sizes, and data types that the choice to use something other than limma should only be exercised by experts.

The trouble with criticizing p-values without an alternative

P-values are an obvious target of wrath by people who don't do day to day statistical analysis because the P-value is the most successful statistical procedure ever invented. If every person who used a P-value cited the inventor, P-values would have, very conservatively3 million citations. That's an insane amount of use for one statistic.

Why would such a terrible statistic be used by so many people? The reason is that it is critical that we have some measure of uncertainty we can assign to data analytic results. Without such a measure, the only way to determine if results are real or not is to rely on people's intuition, which is a notoriously unreliable metric when uncertainty is involved. It is pretty clear science would be much worse off if we decided if results were reliable based on peoples' gut feeling about the data.

P-values can and are misinterpreted, misused, and abused both by naive analysts and by statisticians. Sometimes these problems are due to statistical naiveté, sometimes they are due to wishful thinking and career pressure, and sometimes they are malicious. The reason is that P-values are complicated and require training to understand.

Critics of the P-value argue in favor of a large number of the procedures to be used in place of P-values. But when considering the scale at which the methods must be used to address the demands of the current data rich world, many alternatives would result in similar flaws. This is in no way proves the use of P-values is a good idea, but it does prove that coming up with an alternative is hard. Here are a few potential alternatives.

  1. Methods should only be chosen and applied by true data analytic experts. Pros: This is the best case scenario. Cons: Impossible to implement broadly given the level of statistical and data analytic expertise in the community 
  2. The full prior, likelihood and posterior should be detailed and complete sensitivity analysis should be performed. Pros: In cases where this can be done this provides much more information about the model and uncertainty being considered. Cons: The model requires more advanced statistical expertise, is computationally much more demanding, and can not be applied in problems where model based approaches have not been developed. Yes/no decisions about credibility of results still come down to picking a threshold or allowing more researcher degrees of freedom.
  3. A direct Bayesian approach should be used reporting credible intervals and Bayes estimators. Pros: In cases where the model can be fit, can be used by non-experts, provides scientific measures of uncertainty like confidence intervals. Cons: The prior allows a large number of degrees of freedom when not used by an expert, sensitivity analysis is required to determine the effect of the prior, many more complex models can not be implemented, results are still sample size dependent.
  4. Replace P-values with likelihood ratios. Pros: In cases where it is available may reduce some of the conceptual difficulty with the null hypothesis. Cons: Likelihood ratios can usually only be computed exactly for cases with few or no nuisance parameters, likelihood ratios run into trouble for complex alternatives, they are still sample size dependent, the a likelihood ratio threshold is equivalent to a p-value threshold in many cases.
  5. We should use Confidence Intervals exclusively in the place of p-values.  Pros: A measure and variability on the scale of interest will be reported. We can evaluate effect sizes on a scientific scale.  Cons: Confidence intervals are still sample size dependent and can be misleading for large samples, significance levels can be chosen to make intervals artificially wide/small, if used as a decision making tool there is a one-to-one mapping between a confidence interval and a p-value threshold.
  6. We should use Bayes Factors instead of p-values. Pros: They can compare the evidence (loosely defined) for both the null and alternative. They can incorporate prior information. Cons: Priors provide researcher degrees of freedom, cutoffs may still lead to false/true positives, BF's still depend on sample size.

This is not to say that many of these methods have advantages over P-values. But at scale any of these methods will be prone to abuse, misinterpretation and error. For example, none of them by default deals with multiple testing. Reducing researcher degrees of freedom is good when dealing with a lack of training, but the consequence is potential for mistakes and all of these methods would be ferociously criticized if used as frequently as p-values.

The difference between data analysis and statistics

Many disciplines including medicine and molecular biology usually require an introductory statistics or machine learning class during their program. This is a great start, but is not sufficient for the modern data saturated era. The introductory statistics or machine learning class is enough to teach someone the language of data analysis, but not how to use it. For example, you learn about the t-statistic and how to calculate it. You may also learn the asymptotic properties of the statistic. But you rarely learn about what happens to the t-statistic when there is an unmeasured confounder. You also don't learn how to handle non iid data, sample mixups, reproducibility, most of scripting, etc.

It is therefore critical that if you plan to use or understand data analysis you take both the introductory course and at least one data analysis course. The data analysis course should cover study design, more general data analytic reasoning, non-iid data, biased sampling, basics of non-parametrics, training vs test sets, prediction error, sources of likely problems in data sets (like sample mixups), and reproducibility. These are the concepts that appear regularly when analyzing real data that don't usually appear in the first course in statistics that most medical and molecular biology professionals see. There are awesome statistical educators who are trying hard to bring more of this into the introductory stats world, but it is just too much to cram into one class.

What should we do

The thing that is the most frustrating about the frequent and loud criticisms of P-values is that they usually point out what is wrong with P-values, but don't suggest what we should do about it.  When they do make suggestions, they frequently ignore the fundamental problems:

  1. Statistics are complicated and require careful training to understand properly. This is true regardless of the choice of statistic, philosophy, or algorithm.
  2. Data is incredibly abundant in all disciplines and shows no sign of slowing down.
  3. There is a fundamental shortage of training in statistics and data analysis 
  4. Giving untrained analysts extra researcher degrees of freedom is dangerous.

The most direct solution to this problem is increased training in statistics and data analysis. Every major or program in a discipline that regularly analyzes data (molecular biology, medicine, finance, economics, astrophysics, etc.) should require at minimum an introductory statistics class and a data analysis class. If the expertise doesn't exist to create these sorts of courses there are options. For example, we have introduced a series of 9 courses that run every month that cover most of the basic topics that are common across disciplines.



I think of particular interest given the NIH Director's recent comments on reproducibility is our course on Reproducible Research. There are also many more specialized resources that are very good and widely available that will build on the base we created with the data science specialization.

  1. For scientific software engineering/reproducibility: Software Carpentry.
  2. For data analysis in genomics: Rafa's Data Analysis for Genomics Class.
  3. For Python and computing: The Fundamentals of Computing Specialization

Enforcing education and practice in data analysis is the only way to resolve the problems that people usually attribute to P-values. In the short term, we should at minimum require all the editors of journals who regularly handle data analysis to show competency in statistics and data analysis.

Correction: After seeing Katie K.'s comment on Facebook I concur that P-values were not directly referred to as "worse than useless", so to more fairly represent the article, I have deleted that sentence.

Posted in Uncategorized | 19 Comments

loess explained in a GIF

Local regression (loess) is one of the statistical procedures I most use. Here is a movie showing how it works


Posted in Uncategorized | 6 Comments

Monday data/statistics link roundup (2/10/14)

I'm going to try Monday's for the links. Let me know what you think.

  1. The Guardian is reading our blog. A week after Rafa posts that everyone should learn to code for career preparedness, the Guardian gets on the bandwagon.
  2. Nature Methods published a paper on a webtool for creating boxplots (via Simina B.). The nerdrage rivaled the quilt plot. I'm not opposed to papers like this being published, in fact it is an important part of making sure we don't miss out on the good software when it comes. There are two important things to keep in mind though: (a) Nature Methods grades on a heavy "innovative" curve which makes it pretty hard to publish papers there, so publishing papers like this could cause frustration among people who would submit there and (b) if you use the boxplots from using this tool you must cite the relevant software that generated the boxplot.
  3. This story about Databall (via Rafa.) is great, I love the way that it talks about statisticians as the leaders on a new data type. I also enjoyed reading the paper the story is about. One interesting thing about that paper and many of the papers at the Sloan Sports Conference is that the data are proprietary (via Chris V.) so the code/data/methods are actually not available for most papers (including this one). In the short term this isn't a big deal, the papers are fun to read. In the long term, it will dramatically slow progress. It is almost always a bad long term strategy to make data private if the goal is to maximize value.
  4. The P-value curve for fixing publication bias (via Rafa). I think it is an interesting idea, very similar to our approach for the science-wise false discovery rate. People who liked our paper will like the P-value curve paper. People who hated our paper for the uniformity under the null assumption will hate that paper for the same reason (via David S.)
  5. Hopkins discovers bones are the best (via Michael R.).
  6. Awesome scientific diagrams in tex. Some of these are ridiculous.
  7. Mary Carillo goes crazy on backyard badminton. This is awesome. If you love the Olympics and the internet, you will love this (via Hilary P.)
  8. B'more Biostats has been on a tear lately. I've been reading Leo's post on uploading files to Dropbox/Google drive from R, Mandy's post explaining quantitative MRI, Yenny's post on data sciences, John's post on graduate school open houses, and Alyssa's post on vectorization. If you like Simply Stats you should be following them on Twitter here.
Posted in Uncategorized | 1 Comment

Just a thought on peer reviewing - I can't help myself.

Today I was thinking about reviewing, probably because I was handling a couple of papers as AE and doing tasks associated with reviewing several other papers. I know that this is idle thinking, but suppose peer review was just a drop down ranking with these 6 questions.

  1. How close is this paper to your area of expertise?
  2. Does the paper appear to be technically right?
  3. Does the paper use appropriate statistics/computing?
  4. Is the paper interesting to people in your area?
  5. Is the paper interesting to a broad audience?
  6. Are the appropriate data and code available?

Each question would be rated on a 1-5 star scale. 1 stars = completely inadequate, 3 stars = acceptable, 5 stars = excellent. There would be an optional comments box that would only be used for major/interesting thoughts and anything that got above 3 stars for questions 2, 3, and 6 was published. Incidentally, you could do this for free on Github if the papers were written in markdown, that would reduce the substantial costs of open-access publishing.

No doubt peer review would happen faster this way. I was wondering, would it be any worse?



Posted in Uncategorized | 11 Comments

My Online Course Development Workflow

One of the nice things about developing 9 new courses for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.

The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it's nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn't work, you can start over the next day and have another version done in less than 24 hours.

I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.

  • I use Camtasia for Mac to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it's pretty much glorified screencasting software for me.
  • For talking head type videos I use my iPhone 5S mounted on a tripod. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see here for a much better example of what can be done). I attach the phone to an Apogee microphone to pick up better sound. For some of the interviews that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).
  • To record my primary sound (i.e. me talking), I use the Zoom H4N portable recorder. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.
  • For simple sound recording (no video or screen) I use Audacity.
  • All of my lecture videos are run through Final Cut Pro X on my 15-inch MacBook Pro with Retina Display. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you're not used to a nonlinear editor (as I was not). I bought this excellent book to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.
  • I have an 8TB Western Digital Thunderbolt drive to store the raw video for all my classes (and some backups). I also use two 1TB Thunderbolt drives to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.
  • Finished videos are shared with a Dropbox for Business account so that Jeff, Brian, and I can all look at each other's stuff. Videos are exported to H.264/AAC and uploaded to Coursera.
  • For developing slides, Jeff, Brian, and I have standardized around using Slidify. The beauty of using slidify is that it lets you write everything in Markdown, a super simple text format. It also make it simpler to manage all the course material in Git/GitHub because you don't have to lug around huge PowerPoint files. Everything is  a light-weight text file. And thanks to Ramnath's incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via GitHub Pages).

The first courses for the Data Science Specialization start on April 7th. Don't forget to sign up!

Posted in Uncategorized | Tagged , | 7 Comments

The three tables for genomics collaborations

Collaborations between biologists and statisticians are very common in genomics. For the data analysis to be fruitful, the statistician needs to understand what samples are being analyzed. For the analysis report to make sense to the biologist, it needs to be properly annotated with information such as gene names, genomic location, etc... In a recent post, Jeff laid out his guide for such collaborations, here I describe an approach that has helped me in mine.

In many of my past collaborations, sharing the experiment's key information,  in a way that facilitates data analysis, turned out to be more time consuming than the analysis itself. One reason is that the data producers annotated samples in ways that was imposible to decipher without direct knowledge of the experiment (e.g using lab specific codes in the filenames, or colors in excel files).  In the early days of microarrays, a group of researchers noticed this problem and created a markup language to describe and communicate information about  experiments electronically.

The Bioconductor project took a less ambitious  approach and created classes specifically designed to store the minimal information needed to perform an analysis. These classes can be thought of as three tables, stored as flat text files, all of which are relatively easy to create for biologists.

The first table contains the experimental data with rows representing features (e.g. genes) and the columns representing samples.

The second table contains the sample information. This table contains a row for each column in the experimental data table. This table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the columns of the first table. The second contains the main outcome of interest, e.g. case or control, cancer or normal. Other commonly included columns are the filename of the original raw data associated with each row, the date the experiment was processed, and other information about the samples.

The third table contains the feature information. This table contains a row for each row in the experimental table. The table contains at least two columns. The first contains an identifier that can be used to match the rows of this table to the row of the first table. The second will contain an annotation that makes sense to biologists, e.g. a gene name. For technologies that are widely used (e.g. Affymetrix gene expression arrays) these table are readily available.

With these three relatively simple files in place less time is spent "figuring out" the data and the statisticians can focus their energy on data analysis while the biologists can focus their energy on interpreting the results. This approach seems to have been the inspiration for the MAGE-TAB format.

Note that with newer technologies, statisticians prefer to get access to the raw data. In this case, instead of an experimental data table (table 1), they will want the original raw data files. The sample information then must contain a column with the filenames so that sample annotation can be properly matched.

NB: These three tables are not a complete description of an experiment and are not intended as an alternative to standards such as MAGE and MIAME. But in many cases, they provide the very minimum information needed to carry out a rudimentary analysis. Note that Bioconductor provides tools to import information from MAGE-ML and other related formats.

Posted in Uncategorized | 1 Comment

Not teaching computing and statistics in our public schools will make upward mobility even harder

In his book Average Is Over, Tyler Cowen predicts that as automatization becomes more common, modern economies will eventually be composed of two groups: 1) a highly educated minority involved in the production of  automated services and 2) a vast majority earning very little but enough to consume some of the low-priced products created by group 1.  Not everybody will agree with this view, but we can't ignore the fact that automatization has already eliminated many middle class jobs in, for example, manufacturing and the automotive industries. New technologies, such as driverless cars and online retailers, will very likely eliminate many more jobs (e.g. drivers and retail clerks) than they create (programmers and engineers).

Computer literacy is essential for working with automatized systems. Programming and learning from data are perhaps the most useful skill for creating these systems. Yet the current default curriculum includes neither computer science nor statistics. At the same time, there are plenty of resources for motivated parents with means to get their children to learn these subjects. Kids whose parents don't have the wherewithal to take advantage of these educational resources will be at an even greater disadvantage than they are today. This disadvantage is made worse by the fact that many of the aforementioned resources are free and open to the world  (CodeacademyKhan AcademyEdX, and Coursera for example) which means that a large pool of students that previously had no access to this learning material will also be competing for group 1 jobs. If we want to level the playing field we should start by updating the public school curriculum so that, in principle, everybody has the opportunity to compete.

Posted in Uncategorized | 3 Comments

Announcing the Release of swirl 2.0

Editor's note: This post was written by Nick Carchedi, a Master's degree student in the Department of Biostatistics at Johns Hopkins. He is working with us to develop the Data Science Specialization as well as software for interactive learning of R and statistics.

Official swirl website: swirlstats.com

On September 27, 2013, I wrote a guest blog post on Simply Statistics to announce the creation of Statistics with Interactive R Learning (swirl), an R package for teaching and learning statistics and R simultaneously and interactively. Over the next several months, I received a tremendous amount of feedback from all over the world. Two things became clear: 1) there were many opportunities for improvement on the original design and 2) there's an incredible demand globally for new and better ways of learning statistics and R.

In the spirit of R and open source software, I shared the source code for swirl on GitHub. As a result, I quickly came in contact with several very talented individuals, without whom none of what I'm about to share with you would have been possible. Armed with invaluable feedback and encouragement from early adopters of swirl 1.0, my new team and I pursued a complete overhaul of the original design.

Today, I'm happy to announce the result of our efforts: swirl 2.0.

Like the first version of the software, swirl 2.0 guides students through interactive tutorials in the R console on a variety of topics related to statistics and R. The user selects from a menu of courses, each of which is broken up by topic into shorter lessons. Lessons, in turn, are a dialog between swirl and the user and are composed of text output, multiple choice and text-based questions, and (most importantly) questions that require the user to enter actual R code at the prompt. Responses are evaluated for correctness based on instructor-specified answer tests and appropriate feedback is given immediately to the user.

It's helpful to think of swirl as the synthesis of two separate parts: content and platform. Content is authored by instructors in R Markdown files. The platform is then responsible for delivering this content to the user and interpreting the user's responses in an interactive and engaging way.

Our primary focus for swirl 2.0 was to build a more robust and extensible platform for delivering content. Here's a (nontechnical) summary of new and revised features:

  • A library of answer tests an instructor can deploy to check user input for correctness
  • If stuck, a user can skip a question, causing swirl to enter the correct answer on their behalf
  • During a lesson, a user can pause instruction to play around or practice something they just learned, then use a special keyword to regain swirl's attention when ready to resume
  • swirl "sees" user input the same way R "sees" it, which allows swirl to understand the composition of a user's input on a much deeper level (thanks, Hadley)
  • User progress is saved between sessions
  • More readable output that adjusts to the width of the user's display (thanks again, Hadley)
  • Extensible framework allows others to easily extend swirl's functionality
  • Instructors can author content in a special flavor of R markdown

(For a more technical understanding of swirl's features and inner workings, we encourage readers to consult our GitHub repository.)

Although improving the platform was our first priority for this release, we've made some improvements to existing content and, more importantly, added the beginnings of a new course: Intro to R. Intro to R is our response to the overwhelming demand for a more accessible and interactive way to learn the R language. We've included the first three lessons of the course and plan to add many more over the coming months as our focus turns to creating more high quality content.

Our ultimate goal is to have the statistics and R communities use swirl as a platform to deliver their own content to students interactively. We've heard from many people who have an interest in creating their own content and we're working hard to make the process of creating content as simple and enjoyable as possible.

The goal of swirl is not to be flashy, but rather to provide the most authentic learning environment possible. We accomplish this by placing students directly on the R prompt, within the very same environment they'll use for data analysis when they are not using swirl. We hope you find swirl to be a valuable tool for learning and teaching statistics and R.

It's important to stress that, as with any new software, we expect there will be bugs. At this point, users should still consider themselves "early adopters". For bug reports, suggested enhancements, or to learn more about swirl, please visit our website.


Many people have contributed to this project, either directly or indirectly, since its inception. I will attempt to list them all here, in no particular order. I'm sincerely grateful to each and everyone one of you.

  • Bill & Gina: swirl is as much theirs as it is mine at this point. Their contributions are the only reason the project has evolved so much since the release of swirl 1.0.
  • Brian: Challenged me to turn my idea for swirl into a working prototype. Coined the "swirl" acronym. swirl would still be an idea in my head without his encouragement.
  • Jeff: Pushes me to think big picture and provides endless encouragement. Reminds me that a great platform is worthless without great content.
  • Roger: Encouraged me to separate platform and content, a key paradigm that allowed swirl to mature from a messy prototype to something of real value. Introduced me to Git and GitHub.
  • Lauren & Ethan: Helped with development of the earliest instructional content.
  • Ramnath: Provided a model for content authoring via slidify "flavor" of R Markdown.
  • Hadley: Made key suggestions for improvement and provided an important proof of concept. His work has had a profound influence on swirl's development.
  • Peter: Our discussions led to a better understanding of some key ideas behind swirl 2.0.
  • Sally & Liz: Beta testers and victims of my endless rants during stats tutoring sessions.
  • Kelly: Most talented graphic designer I know and mastermind behind the swirl logo. First line of defense against bad ideas, poor design, and crappy websites. Visit her website.
  • Mom & Dad: Beta testers and my #1 fans overall.
Posted in Uncategorized | Leave a comment

Marie Curie says stop hating on quilt plots already.

"There are sadistic scientists who hurry to hunt down error instead of establishing the truth." -Marie Curie (http://en.wikiquote.org/wiki/Marie_Curie)

Thanks to Kasper H. for that quote. I think it is a perfect fit for today's culture of academic put down as academic contribution. One perfect example is the explosion of hate against the quilt plot. A quilt plot is a heatmap with several parameters selected in advance; that's it. This simplification of R's heatmap function appeared in the journal PLoS One. They say (though not up front and not clearly enough for my personal taste) that they know it is just a heatmap.

Over the course of the next several weeks quilt plots went viral. Here are a few example tweets. It was also widely trashed on people's blogs and even in the scientist. So I did an experiment. I built a table of frequencies in R like this and applied the heatmap function in R, then the quilt.plot function in fields, then the function written by the authors of the paper with as minimal tweeking as possible.

x = matrix(rbinom(25,size=4,prob=0.5),nrow=5)
pt = prop.table(x)

Here are the results:







It is clear that out of the box and with no tinkering, the new plot makes something nicer/more interpretable. The columns/rows are where I expect and the scale is there and nicely labeled. Everyone who has ever made heatmaps in R has some bit of code that looks like this:


To hack together a heatmap in R that looks like you expect. It is a total pain. Obviously the quilt plot paper has a few flaws:

  1. It tries to introduce the quilt plot as a new idea.
  2. It doesn't just come out and say it is a hack of the heatmap function, but tries to dance around it.
  3. It produces code, but only as images in word files. I had to retype the code to make my plot.

That being said here are a couple of other true things about the paper:

  1. The code works if you type it out and apply it.
  2. They produced code.
  3. The paper is open access.
  4. The paper is correct technically.
  5. The hack is useful for users with few R skills.

So why exactly isn't it a paper? It smacks of academic elitism to claim that this isn't good enough because it isn't a "new idea". Not every paper discovers radium. Some papers are better than others and that is ok. I think the quilt plot being published isn't a problem, maybe I don't like the way it is written exactly, but they do acknowledge the heat map, they do produce correct, relevant code, and it does solve a problem people actually have. That is better than a lot of papers that appear in more prestigious journals. Arsenic life anyone?

I think it is useful to have a forum where people can post correct, useful, but not necessarily ground breaking results and get credit for them, even if the credit is modest. Otherwise we might miss out on useful bits of code. Frank Harrell has a bunch of functions that tons of people use but he doesn't get citations, you probably have heard of the Hmisc package if you use R.

But did you know Karl Broman has a bunch of really useful functions in his personal R package, qqline2 is great. I know Rafa has a bunch of functions he has never published because they seem "too trivial" but I use them all the time. Every scientist who touches code has a personal library like this. I'm not saying the quilt plot is in that category. But I am saying that it is stupid not to have a public forum for making these functions available to other scientists. But that won't happen if the "quilt plot backlash" is what people see when they try to get published credit for simple code that solves real problems.

Hacks like the quilt plot can help people who aren't comfortable with R write reproducible scripts without having to figure out every plotting parameter. Keeping in mind that the vast majority of data analysis is not done by statisticians, it seems like these little hacks are an important part of science. If you believe in figshare, github, open science, and shareable code, you shouldn't be making fun of the quilt plotters.

Marie Curie says so.

Posted in Uncategorized | 6 Comments

The Johns Hopkins Data Science Specialization on Coursera

We are very proud to announce the the Johns Hopkins Data Science Specialization on Coursera. You can see the official announcement from the Coursera folks here. This is the main reason Simply Statistics has been a little quiet lately.

The three of us (Brian Caffo, Roger Peng, and Jeff Leek) along with a couple of incredibly hard working graduate students (Nick Carchedi of swirl fame and Sean Kross) have put together nine new one-month classes to run on the Coursera platform. The classes are:

  1. The Data Scientist's Toolbox  - A basic introduction to data and data science and a  basic guide to R/Rstudio/Github/Command Line Interface.
  2. R Programming  - Introduction to R programming, from installing R to types, to functions, to control structures.
  3. Getting and Cleaning Data - An introduction to getting data from the web, from images, from APIs, and from databases. The course also covers how to go from raw data to tidy data.
  4. Exploratory Data Analysis - This course covers plotting in base graphics, lattice, ggplot2 and clustering and other exploratory techniques. It also covers how to think about exploring data you haven't seen.
  5. Reproducible Research  - This is one of the unique courses to our sequence. It covers how to think about reproducible research, evidence based data analysis, reproducible research checklists and knitr, markdown, R markdown, etc.
  6. Statistical Inference  - This course covers the fundamentals of statistical inference from a practical perspective. The course covers both the technical details and important ideas like confounding.
  7. Regression Models  - This course covers the fundamentals of linear and generalized linear regression modeling. It also serves as an introduction to how to "think about" relating variables to each other quantitatively.
  8. Practical Machine Learning  - This course will cover the basic conceptual ideas in machine learning like in/out of sample errors, cross validation, and training and test sets. It will also cover a range of machine learning algorithms and their practical implementation.
  9. Developing Data Products  - This course will cover how to develop tools for communicating data, methods, and analyses with other people. It will cover building R packages, Shiny, and Slidify, among other things.

There will also be a specialization project - consisting of a 10th class where students will work on projects conducted with industry, government, and academic partners.

The classes represent some of the content we have previously covered in our popular Coursera classes and a ton of brand new content for this specialization. Here are some things that I think make our program stand out:

  • We will roll out 3 classes at a time starting in April. Once a class is running, it will run every single month concurrently.
  • The specialization offers a bunch of unique content, particularly in the courses Getting and Cleaning Data, Reproducible Research, and Developing Data Products.
  • All of the content is being developed open source and open-access on Github. You are welcome to check it out as we develop it and contribute!
  • You can take the first 9 courses of the specialization entirely for free.
  • You can choose to pay a very modest fee to get "Signature Track" certification in every course.

I have also created a little page that summarizes some of the unique aspects of our program. Scroll through it and you'll find sharing links at the bottom. Please share with your friends, we think this is pretty cool: http://jhudatascience.org

Posted in Uncategorized | 22 Comments