Simply Statistics: The 80/20 rule of statistical methods development

Developing statistical methods is hard and often frustrating work. One of the under appreciated rules in statistical methods development is what I call the 80/20 rule (maybe could even by the 90/10 rule). The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%. (Edit: Rafa points out that once again I’ve reverse-scooped a bunch of people and this is already a thing that has been pointed out many times. See for example the Pareto principle and this post also called the 80:20 rule)

Sometimes that extra 20% is really important and sometimes it isn’t. In a clinical trial, where each additional patient may cost a large amount of money to recruit and enroll, it is definitely worth the effort. For more exploratory techniques like those often used when analyzing high-dimensional data it may not. This is particularly true because the extra 20% usually comes at a cost of additional assumptions about the way the world works. If your assumptions are right, you get the 20%, if they are wrong, you may lose and it isn’t always clear how much.

Here is a very simple example of the 80/20 rule from frequentist statistics - in my experience similar ideas hold in machine learning and Bayesian inference as well. Suppose that I collect some observations and want to test whether the mean of the observations is greater than 0. Suppose I know that the data are normal and that the variance is equal to 1. Then the absolute best statistical test (called the uniformly most powerful test) you could do rejects the hypothesis the mean is zero if versus the alternative that the probability is greater than 0.5 . Or you could use the one sided t-test. Or you could use the Wilcoxon test. These are suboptimal if you know the data are Normal with variance one.

I tried each of these tests with a sample of size at the level. In the plot below I show the ratio of power between each non-optimal test and the optimal z-test (you could do this theoretically but I’m lazy so did it with simulation, code here, colors by RSkittleBrewer).

The tests get to 80% of the power of the z-test for different sizes of the true mean (0.6 for Wilcoxon, 0.5 for the t-test, and 0.85 for the sign test). Overall, these methods very quickly catch up to the optimal method.

In this case, the non-optimal methods aren’t much easier to implement than the optimal solution. But in many cases, the optimal method requires significantly more computation, memory, assumptions, theory, or some combination of the four. The hard decision is whether to create a new method is whether the 20% is worth it. This is obviously application specific.

An important corollary of the 80/20 rule is that you can have a huge impact on new technologies if you are the first to suggest an already known 80% solution. For example, the first person to suggest hierarchical clustering or the singular value decomposition for a new high-dimensional data type will often get a large number of citations. But that is a hard way to make a living - you aren’t the only person who knows about these methods and the person who says it first soaks up a huge fraction of the credit. So the only way to take advantage of this corollary is to spend your time constantly trying to figure out what the next big technology will be. And you know what they say about prediction being hard, especially about the future.