The Simons program on Big Data wrapped up about a month ago with a workshop on Big Data and Differential Privacy. With the possible exception of one talk on the last day after lunch, it was a really fabulous workshop. You would've enjoyed attending even if you were neither interested in Big Data nor Differential Privacy. One reason is that the organizers aimed to address a problem that transcends both machine learning and privacy, a problem relevant to all empirical sciences. The problem is false discovery.

If you've never seen it, you should check out this XKCD comic explaining why green jelly beans cause acne. The idea is that if you evaluate enough hypotheses on the same data set, eventually you will find something significant. The availability of public data sets used by thousands of scientists greatly exacerbates the problem. An important example is the area of Genome Wide Association Studies (GWAS).

A typical study has a few hundred or perhaps a few thousands patient DNA sequences. Each sequence contains a few hundred thousands so-called SNPs. Scientists check each SNP for possible correlation with a particular phenotype. Significance tests and p-values are part of the standard method. But what p-value is reasonable? In GWAS typical p-values are as small as \({10^{-6}}\) or even \({10^{-7}.}\) They get smaller ever year as the SNP resolution of the sequenced genome increases. How many SNPs are incorrectly declared significant? How do we interpret these scientific findings?

A similar discussion arose recently in the FDA induced shutdown of 23andMe. Lior Pachter started a heated discussion on his blog about why the findings of of 23andMe are unreliable. In short: Too many hypotheses evaluated against a single genome. Meanwhile, Scott Aaronson asserts his right to misinterpret probabilities.

GWAS is certainly not the only example. Some think a major scientific crisis is upon us.

So, what can we do about it? A person who has spent at least two decades thinking about this problem is Yoav Benjamini. Luckily, he agreed to give a tutorial about his work at the privacy workshop revolving around the idea of False Discovery Rate.

False Discovery Rate

The basic setup is this. There are \({m}\) null hypotheses \({h_1,\dots,h_m}\) that we would like to test. A discovery corresponds to a rejected null hypothesis. Let \({V}\) be the random variable counting the number of false discoveries, i.e., rejected null hypotheses that are actually true. We also let \({R}\) denote the total number of rejected hypothesis, i.e., all discoveries both true and false. The false discovery rate (FDR) is the expectation of the ratio \({V/R,}\) where we define the ratio as \({0}\) if \({R=0.}\)

The idea behind this definition is that if there are many discoveries, it might be tolerable to make some false discoveries. However, if all null hypotheses are true, all discoveries are false. So, even a single discovery brings the FDR up to \({1}\) (the largest possible value).

I encourage you to check out Omer Reingold's Musings on False Discovery Rate from a Computer Scientist's point of view.

Now, suppose we want to design an experiment while controlling the FDR. That is we want to make sure that the FDR is at most some \({\alpha< 1.}\) Benjamini and Hochberg suggested an algorithm for doing exactly that. With more than 20000 citations this might be one of the most widely cited algorithms:

Let \({p_1,\dots,p_m}\) be \({p}\)-values corresponding to our \({m}\) hypotheses. Sort these \({p}\)-values in increasing order \({p_{(1)}\le \dots\le p_{(m)}.}\)

1. Find the largest \({k}\) such that \({p_{(k)} \le \frac{k}{m}\cdot \alpha}\).

2. Reject \({h_{(1)},\dots,h_{(k)}.}\)

An important theorem they proved is that indeed this procedure controls the false discovery rate under certain assumptions. For example, if our test statistics are independent of each other, this holds. Of course, independence is a very strong assumption and a lot of work in statistics aims to weaken these assumptions.

My general feeling about this definition is that it is optimistic in several regards. The first is that we only talk about expectations. It is easy to come up with examples where the expectation is small but the with small but non-negligible probability there are many false discoveries. Second, we need several fairly strong assumptions to be able to prove that such a procedure actually controls the false discovery rate. Finally, the method puts lot of faith in careful experimental design and proper execution. In particular to apply the methodology we need knowledge of what hypotheses we might test.

Nevertheless, this optimism in the definition leads to usable and realistic answers as to how large the sample should be in concrete experimental setups. This must have been a major factor in the success of this methodology.

Differential Privacy

How is any of this related to privacy? The short answer is that Differential Privacy by itself controls false discovery in a certain precise sense---though formally different from the above. Indeed, I will formally show below that Differential Privacy implies small "generalization error".

This is by no means an accident of the definition. There is a deeper reason for it. Intuitively, Differential Privacy ensures that the outcome of a statistical analysis does not depend on the specifics of the data set but rather the underlying properties of the population. This is why it gives privacy. I may be able to learn from the data that smoking causes cancer, but I wouldn't be able to learn that a particular individual in the database, say Donald Draper, smokes and has cancer. The flip side is that Differential Privacy does not allow you to access the data set arbitrarily often. Instead it attaches a cost to each operation and an overall budget that you better not exceed. The most common complain about Differential Privacy is that it doesn't allow you to do whatever the heck you want with the data. My response is usually that neither does statistics. A data set has limited explanatory power. Use it carelessly and you'll end up with false discovery. I think it's a feature of Differential Privacy---not a shortcoming---that it makes this fundamental fact of nature explicit.

If this philosophical excursion struck you as unconvincing, let me try the formal route. This argument is folklore by the way. I learned it from Guy Rothblum a few years ago who attributed it to Frank McSherry at the time. Let me know if I'm missing the right attribution.

Suppose we want to learn a concept \({c\colon\{0,1\}^d \rightarrow \{0,1\}}\) from labeled examples \({D=\{(x_1,l_1),\dots,(x_m,l_m)\}}\) drawn from some underlying distribution \({P}\).

Now, suppose we use an \({\epsilon}\)-differentially private learning algorithm \(\) for the task. That is \(\) is a randomized algorithm such that changing one example has little effect on the output distribution of the algorithm. Formally, for any set of examples \({D'}\) that differs from \({D}\) in only one pair, we have for any set of output hypotheses \({H}\):

\[\Pr \{ A(D) \in H \} \ge e^{-\epsilon}\Pr\{ A(D’) \in H \}. \]

Furthermore, assume that \(\) is \({\alpha}\)-useful in the sense that it always outputs a hypothesis \({h}\) that has error at most \({\alpha}\) on the training examples. Here, \({\epsilon,\alpha \ll 1.}\) (By the way, I'm using the word "hypothesis" here in its learning theory sense not the "null hypothesis" sense.)

I claim that with the output of \(\) must have error at most \({O(\epsilon + \alpha)}\) on the underlying distribution \({P.}\) This means precisely that the output generalizes. Importantly, we didn't assume anything about the hypothesis class! We only assume that the learner is differentially private.

Proof sketch: Pick a fresh example \({(x^*,l^*)}\) from the distribution \({P.}\) Let \({H^*= \{ h \colon h(x^*) = c(x^*) \}}\) be the set of hypotheses that agree with the concept \({c}\) on this example \({x^*.}\) Let \({D'}\) be the example set where we remove a random element from \({D}\) and replace it with \({(x^*,l^*).}\) Since, \(\) is useful, it must be the case that \({h' = {\cal A}(D')}\) has error at most \({\alpha}\) on \({D'}\). In particular, it classifies \({x^*}\) correctly with probability \({1-\alpha.}\) This is because all examples in \({D'}\) are identically distributed. Formally, \({\Pr\{ {\cal A}(D') \in H^* \}\ge 1-\alpha.}\) Note the randomness is now over both \(\) and the replacement we did. Appealing to the definition of Differential Privacy above, it follows that

\(\displaystyle \Pr\{ A(D) \in H^*\} \ge e^{-\epsilon}(1-\alpha) \ge 1 - O(\epsilon + \alpha). \)

That means \({A(D)}\) is correct on an *unseen* example from \({D}\) with high probability.

The above is just a simple example of a broader phenomenon showing that stability of a learning algorithm implies generalization bounds. Differential Privacy is a strong form of stability that implies most notions of stability studied in learning theory.

What's next?

False discovery and privacy loss are two persistent issues we're facing in the era of data analysis. The two problems share some fundamental characteristics. It could be very fruitful think about both problems in relation to each other, or, even as facets of the same underlying problem. Technically, there seems to be some unexplored middle ground between FDR and Differential Privacy that I'd like to understand better.

The discussion here is also closely related to my first post of the semester in which I asked: What should a theory of big data do? I argued that besides posing concrete technical challenges, big data poses a deep conceptual problem. We seem to have difficulty capturing the way people interact with data today. This makes it hard for us to get control over the things that can go wrong. I seem to have come a full circle asking the same question again. Of course, I knew that this was too broad a question to be resolved in the span of four months.