Privacy as a function of sample size

Jeff Leek

The U.S. Supreme Court just made a unanimous ruling in Riley v. California making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:

 Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.

But I was more fascinated by this quote:

The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.

So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough.  I wonder what the minimum sample size needed is to officially violate someone’s privacy. I’d be curious get Cathy O’Neil’s opinion on that question, she seems to have thought very hard about the relationship between data and privacy.

This is another case where I think that, to some extent, the Supreme Court made a decision on the basis of a statistical concept. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?

If any of the Supreme’s want a primer in statistics, I’m available.