Adaptive data analysis

I just returned from NIPS 2015, a joyful week of corporate parties featuring deep learning themed cocktails, money talk, recruiting events, and some scientific activities on the side. In the latter category, I co-organized a workshop on adaptive data analysis with Vitaly Feldman, Aaron Roth and Adam Smith.

Our workshop responds to an increasingly pressing issue in machine learning and statistics. The classic view of these fields has it that we choose our method independently of the data to which we intend to apply the method. For example, a hypothesis test must be fixed in advance before we see the data. I sometimes call this static or confirmatory data analysis. You need to know exactly what you want to do before you collect the data and run your experiment. In contrast, in practice we typically choose our methods as a function of the data to which we apply them. In other words, we adapt our method to the data.

static vs adaptive

Adaptivity is both powerful and dangerous. Working adaptively gives us greater flexibility to make unanticipated discoveries as it allows us to execute more complex analysis work flows. But it can also lead to false discovery and misleading conclusions far more easily than static data analysis. While there are many other issues to keep in mind, is not unreasonable to blame our lack of understanding adaptivity in part for exacerbating a range of problems from false discovery in the empirical sciences to overfitting in machine learning competitions.

I was hugely impressed with the incredibly exciting group of participants and audiences from both computer science and statistics that our workshop brought together. I felt that we made actual progress on beginning to understand the diverse perspectives on this complex issue. The goal of this post is to convey my excitement for this emerging research area as I attempt to summarize the different perspectives we saw.

The frequentist statistics perspective

Null hypothesis tests are still widely used across the empirical sciences to gauge the validity of findings. Scientists routinely calculate p-values with the hope of being able to reject a null hypothesis and thus claim a “statistically significant” finding. If we carry out multiple hypothesis tests, we need to adjust our p-values for the fact that we made multiple tests. A safe way of correcting for multiple tests is the Bonferroni correction which amounts to multiplying all p-values by the number of tests. Computer scientists call this the union bound. While Bonferroni is safe, it makes discoveries difficult in the common situation where we have lots of tests and no individual signal is particularly strong. A more powerful alternative to Bonferroni is to control the False Discovery Rate (FDR) proposed in a celebrated paper by Benjamini and Hochberg. Intuitively, controlling FDR amounts to putting a bound on the expected ratio of all false discoveries (number of rejected true null hypotheses) to all discoveries (number of rejected nulls). The famous Benjamini-Hochberg procedure gives one beautiful way of doing this.

Online False Discovery Rate

Andrea Montanari and Dean Foster discussed more recent works on online variants of FDR. Here, the goal is to control FDR not at the end of the testing process, but rather at all points along the way. In particular, the scientist must choose whether or not to reject a hypothesis at any time point without knowing the outcome of future tests. The word online should not be confused with adaptive. Although we could in principle choose hypotheses adaptively in this framework, we still need the traditional assumptions that all p-values are independent of each other and distributed the way they should be (i.e., uniform if the null hypothesis is true). If the selection of hypothesis tests was truly adaptive, these assumptions are unlikely to be satisfied and hard to be verified at any rate.

Inference after selection

But what if our hypothesis tests are chosen as a function of the data? For example, what if we first choose a set of promising data attributes and test only these attributes for significance? This natural two-step procedure is called inference after selection in statistics. Rob Tibshirani gave an entire keynote about this topic. Will Fithian went into further detail in his talk at our workshop. Several recent works in this area show how we can first perform variable selection using an algorithm such as Lasso, followed by hypothesis testing on the selected variables. What makes these results possible is a careful analysis of the distribution of the p-values conditional on the selection by Lasso. For example, if the test statistic followed a normal distribution before selection, it will follow a certain truncated normal distribution after selection. This approach leads to very accurate characterizations and often tight confidence intervals. However, it has the shortcoming that the we need to commit to a particular selection and inference procedure as the analysis crucially exploits these.

Rina Foygel Barber expanded on the theme of adaptivity and false discovery rate by showing how to control FDR when we first order our hypothesis tests in a data-dependent manner to allow for making important discoveries sooner.

The Bayesian view

Andrew Gelman’s talk (see his slides) contributed many illustrative examples of flawed empirical studies. His suggested cure for the woes of adaptivity was to do more of it. However, he advocated Bayesian hierarchical modeling to explicitly account for all the possible outcomes of an adaptive study. When asked if this wouldn’t put too much burden on the side of modeling, he replied that modeling should be seen as a feature as it forces people to explicitly discuss the experimental setup.

The stability approach

An intriguing approach to generalization in machine learning is the idea of stability. An algorithm is stable if its output changes only slightly (in some formal sense) under an arbitrary substitution of a single input data point. A seminal work of Bousquet and Elisseeff showed that stability implies generalization. That is, patterns observed by a stable algorithm on a sample must also exist in the underlying population from which the sample was drawn. Nati Srebro explained in his talk how stability is a universal property in the sense that it is both sufficient and necessary for learnability.

stability

The issue is that stability by itself doesn’t address adaptivity. Indeed, the classic works on stability apply to the typical non-adaptive setting of learning where the sample is independent of the learning algorithm.

Cynthia Dwork talked about recent works that address this shortcoming. Specifically, differential privacy is a stability notion that applies even in the setting of adaptive data analysis. Hence, differential privacy implies validity in adaptive data analysis. Drawing on many powerful adaptive data analysis algorithms from differential privacy, this gives a range of statistically valid tools for adaptive data analysis. See, for example, my blog post on the reusable holdout method that came out of this line of work.

Information-theoretic measures

The intuition behind differential privacy is that if an analysis does not reveal too much about the specifics of a data set, then it is impossible to overfit to the data set based on the available information. This suggests an information-theoretic approach that quantifies the mutual information between the sample and the output of the algorithm. A recent work shows that indeed certain strengthenings of mutual information prevent overfitting. Being completely general, this viewpoint allows us, for example, to discuss deterministic algorithms, whereas all differentially private algorithms are randomized. James Zou discussed the information-theoretic approach further and mentioned his recent work on this topic.

Complexity-theoretic obstructions

In a dramatic turn of events, Jon Ullman told us why preventing false discovery in adaptive data analysis can be computationally intractable.

To understand the result, think of adaptive data analysis for a moment as an interactive protocol between the data analyst and the algorithm. The algorithm has access to a sample of size \(n\) and the analyst wants to learn about the underlying population by asking the algorithm a sequence of adaptively chosen questions.

What Jon talked about is that there is a computationally efficient adaptive data analyst that can force any computationally efficient algorithm on a sample of size \(n\) to return a completely invalid estimate after no more than \(n^2\) adaptively chosen questions. Non-adaptively, this would require exponentially many questions. While this is a worst-case hardness result, it points to a computational barrier for what might have looked like a purely statistical problem. The \(n^2\) bound turns out to be tight in light of the differential privacy based positive results I mentioned earlier.

Panel discussion

Toward the end of our workshop, we had a fantastic panel discussion on open problems and conceptual questions. I don’t remember all of it, unfortunately. Below are some topics that got stuck in my mind. If you remember more, please leave a comment.

Exact modeling versus worst-case assumptions

The approaches we saw cluster around two extremes. One is the case of exact analysis or modeling, for example, in the work on inference after selection. In these cases, we are able to exactly analyze the conditional distributions arising as a result of adaptivity and adjust our methods accordingly. The work on differential privacy in contrast makes no assumptions on the analyst. Both approaches have advantages and disadvantages. One is more exact and quantitatively less wasteful, but applies only to specific procedures. The other is more general and less restrictive to the analyst, but this generality also leads to hardness results. A goal for future work is to find middle ground.

Human adaptivity versus algorithmic adaptivity

There is a curious distinction we need to draw. Algorithms can be adaptive. Gradient descent, for example, is adaptive because it probes the data at a sequence of adaptively chosen points. Lasso followed by inference is another example of a single algorithm that exhibits some adaptive behavior. However, adaptivity also arises through the way humans work adaptively with the data. It even arises when a group of researchers study the same data set through a sequence of publications. Each publication builds on previous insights and is hence adaptive. It is much harder to analyze human adaptivity than algorithmic adaptivity. How can we even quantify the extent to which this is a problem? Take MNIST, for example. For almost two decades this data set has been used as a benchmark with the same test set. Although no individual algorithm is directly trained against the test set, it is quite likely that the sequence of proposed algorithms over the years strongly overfits to the test data. It seems that this is an issue we should take more seriously.

Building better intuition in practice

Any practical application of the algorithms we discussed will likely violate either the modeling assumptions we make or the quantitative regime to which our theory applies. This shouldn’t be surprising. Even in non-adaptive data analysis we make assumptions (such as normal approximations) that are not completely true in practice. What is different is that we have far less intuition for what works in practice in the adaptive setting.

What’s next?

After this workshop, I’m even more convinced that adaptive data analysis is here to stay. I expect lots of work in this area with many new theoretical insights. I also hope that these developments will lead to new tools that make the practice of machine learning and statistics more reliable.