Moody RdA blog about machine learning.
http://blog.mrtz.org/
Tue, 18 Apr 2017 02:45:39 +0000Tue, 18 Apr 2017 02:45:39 +0000Jekyll v3.4.3Gradient descent learns linear dynamical systems<p>Cross-posted at <a href="http://www.offconvex.org/2016/10/13/gradient-descent-learns-dynamical-systems/">offconvex.org</a>.</p>
<p>From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.</p>
<p>In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown <em>linear</em> dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.</p>
<h2 id="linear-dynamical-systems">Linear dynamical systems</h2>
<p>We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:</p>
<script type="math/tex; mode=display">h_{t+1} = Ah_t + Bx_t~~~~~~~~~~~~~~~~~~~~~</script>
<script type="math/tex; mode=display">\quad \quad\quad~y_t = Ch_t+Dx_t+\xi_t ~~~~~~~~~~~~~~~~(1)</script>
<p>Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.</p>
<p>Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.</p>
<p>We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is
<script type="math/tex">f(\widehat{\Theta}, (x,y)) = \frac{1}{T}\sum_{t=1}^{T}\left\|y_t-\hat{y}_t\right\|^2</script>. The population risk is defined as the expected log-likelihood,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \mathbb{E}_{(x,y)} \left[f(\widehat{\Theta}, (x,y))\right]</script>
<p>Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still <em>close enough</em> to convex for stochastic gradient to make steady progress towards the global minimum.</p>
<h2 id="hair-dryers-and-quasi-convex-functions">Hair dryers and quasi-convex functions</h2>
<p>Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a <em>low</em> temperature setting and a <em>high</em> temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like <a href="https://www.mathworks.com/help/ident/examples/estimating-simple-models-from-real-laboratory-process-data.html?prodcode=ML">this</a>:</p>
<div style="text-align:center;">
<img style="width:800px;" src="/assets/sysid/dryer/dryer-0.svg" />
</div>
<p>You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="width:800px;" id="imganim" src="/assets/sysid/dryer/dryer-1.svg" onclick="forward_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var images = [
"/assets/sysid/dryer/dryer-1.svg",
"/assets/sysid/dryer/dryer-2.svg",
"/assets/sysid/dryer/dryer-3.svg",
"/assets/sysid/dryer/dryer-4.svg",
"/assets/sysid/dryer/dryer-5.svg",
"/assets/sysid/dryer/dryer-6.svg",
"/assets/sysid/dryer/dryer-7.svg",
"/assets/sysid/dryer/dryer-8.svg",
]
var iC = 0
function forward_image(){
iC = iC + 1;
document.getElementById('imganim').src = images[iC%8];
document.getElementById('counter').textContent = 50* (iC%8);
}
//]]>
</script>
<p><!-- end animation --></p>
<p><em>The blue line shows the predictions of SGD after <span style="font-family:monospace;"><span id="counter">0</span>/400</span> gradient updates. Click to advance.</em></p>
<p>Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.</p>
<div style="text-align:center;">
<img src="/assets/sysid/dryer-segment.svg" />
</div>
<p>The function is clearly not convex, but it doesn’t look too bad either. In particular, from the picture, it could be that the objective function is <em>quasi-convex</em>:</p>
<blockquote>
<p><strong>Definition:</strong> For $\tau > 0$, a function $f(\theta)$ is $\tau$-quasi-convex with respect to a global minimum $\theta ^ * $ if for every $\theta$,
<script type="math/tex">\langle \nabla f(\theta), \theta - \theta^* \rangle \ge \tau (f(\theta)-f(\theta^*)).</script></p>
</blockquote>
<p>Intuitively, quasi-convexity states that the descent direction $-\nabla f(\theta)$ is positively correlated with the ideal moving direction $\theta^* -\theta$. This implies that the potential function $\left|\theta-\theta ^ * \right|^2$ decreases in expectation at each step of stochastic gradient descent. This observation plugs nicely into the standard SGD analysis, leading to the following result:</p>
<blockquote>
<p><strong>Proposition:</strong> (informal) Suppose the population risk $f(\theta)$ is $\tau$-quasi-convex, then stochastic gradient descent (with fresh samples at each iteration and proper learning rate) converges to a point $\theta_K$ in $K$ iterations with error bounded by
$ f(\theta_K) - f(\theta^*) \leq O(1/(\tau \sqrt{K}))$.</p>
</blockquote>
<p>The key challenge for us is to understand under what conditions we can prove that the population risk objective is in fact quasi-convex. This requires some background.</p>
<h2 id="control-theory-polynomial-roots-and-pac-man">Control theory, polynomial roots, and Pac-Man</h2>
<p>A linear dynamical system $(A,B,C,D)$ is equivalent to the system $(TAT^{-1}, TB, CT^{-1}, D)$ for any invertible matrix $T$ in terms of the behavior of the outputs. A little thought shows therefore that in its unrestricted parameterization the objective function cannot have a unique optimum. A common way of removing this redundancy is to impose a canonical form. Almost all non-degenerate system admit the <em>controllable canonical form</em>, defined as</p>
<script type="math/tex; mode=display">% <![CDATA[
A\; = \;
\left[ \begin{array}{ccccc} 0 & 1 & 0 & \cdots & 0 \newline 0 & 0 & 1 & \cdots & 0 \newline
\vdots & \vdots & \vdots & \ddots & \vdots \newline 0 & 0 & 0 & \cdots & 1 \newline
-a_n & -a_{n-1} & -a_{n-2} & \cdots & -a_1 \end{array} \right]
\qquad
B = \left[ \begin{array}{c} 0\newline 0 \newline\vdots \newline 0 \newline 1 \end{array} \right] %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
C\;~= \;
\left[ \begin{array}{ccccc} c_1~~~~& c_2~~~~ & c_3~~~~& ~~\cdots\cdots~~~~& c_n \end{array} \right]
\qquad
D =~~ \left[ \begin{array}{c} d\end{array} \right] %]]></script>
<p>We will also parametrize our training model using these forms. One of its nice properties is that the coefficients of the characteristic polynomial of the <em>state transition matrix</em> $A$ can be read off from the last row of $A$. That is,
<script type="math/tex">det(zI-A) = p_a(z) := z^n+a_1z^{n-1}+\dots + a_n.</script></p>
<p>Even in controllable canonical form, it still seems rather difficult to learn arbitrary linear dynamical systems. A natural restriction would be <em>stability</em>, that is, to require that the eigenvalues of $A$ are all bounded by $1.$ Equivalently, the roots of the characteristic polynomial should all be contained in the complex unit disc. Without stability, the state of the system could blow up exponentially making robust learning difficult. But the set of all stable systems forms a non-convex domain. It seems daunting to guarantee that stochastic gradient descent would converge from an arbtirary starting point in this domain without ever leaving the domain.</p>
<p>We will therefore impose a stronger restriction on the roots of the characteristic polynomial. We call this the Pac-Man condition. You can think of it as a strengthening of stability.</p>
<blockquote>
<p><strong>Pac-Man condition</strong>: A linear dynamical system in controllable canonical form satisfies the Pac-Man condition if the coefficient vector $a$ defining the state transition matrix satisfies
<script type="math/tex">|Re(q_a(z))| > |Im(q_a(z))|</script> for all complex numbers $z$ of modulus $|z| = 1$, where $q_a(z) = p_a(z)/z^n = 1+a_1z^{-1}+\dots + a_nz^{-n}$.</p>
</blockquote>
<div style="text-align:center;">
<img style="width:350px;margin-bottom:50px;" src="/assets/sysid/pacman.png" />
<img style="width:400px;" src="/assets/sysid/trace-degree4.png" />
</div>
<p><em>Above, we illustrate this condition for a degree 4 system plotting the value of $q_a(z)$ on complex plane for all complex numbers $z$ on the unit circle.</em></p>
<p>We note that Pac-Man condition is satisfied by vectors $a$ with $|a|_1\le \sqrt{2}/2$. Moreover, if $a$ is a random Gaussian vector with expected $\ell_2$ norm bounded by $o(1/\sqrt{\log n})$, then it will satisfy Pac-Man condition with probability $1-o(1)$. Roughly speaking, the assumption requires the roots of the characteristic polynomial $p_a(z)$ are relatively dispersed inside the unit circle.</p>
<p>The Pac-Man condition has three important implications:</p>
<ol>
<li>
<p>It implies via <a href="https://en.wikipedia.org/wiki/Rouch%C3%A9%27s_theorem">Rouche’s theorem</a> that the spectral radius of A is smaller than 1 and therefore ensures stability of the system.</p>
</li>
<li>
<p>The vectors satisfying it form a convex set in $\mathbb{R}^n$.</p>
</li>
<li>
<p>Finally, it ensures that the objective function is <em>quasi-convex</em></p>
</li>
</ol>
<h2 id="main-result">Main result</h2>
<p>Relying on the Pac-Man condition, we can show:</p>
<blockquote>
<p><strong>Main theorem (Hardt, Ma, Recht, 2016)</strong>: Under the Pac-Man condition, projected gradient descent algorithm, given $N$ sample sequences of length $T$, returns parameters $\widehat{\Theta}$ with population risk
<script type="math/tex">f(\widehat{\Theta}) \le f(\Theta) + poly(n)/\sqrt{NT}.</script></p>
</blockquote>
<p>The theorem sorts out the right dependence on $N$ and $T$. Even if there is only one sequence, we can learn the system provided that the sequence is long enough. Similarly, even if sequences are really short, we can learn provided that there are enough sequences.</p>
<h2 id="quasi-convexity-in-the-frequency-domain">Quasi-convexity in the frequency domain</h2>
<p>To establish quasi-convexity under the Pac-Man condition, we will first develop an explicit formula for the population risk in frequency domain. In doing so, we assume that $x_1,\dots, x_T$ are pairwise independent with mean 0 and variance 1. We also consider the population risk as $T\rightarrow \infty$ for simplicity in this post.</p>
<p>A simple algebraic manipulation simplifies the population risk with infinite sequence length to</p>
<script type="math/tex; mode=display">\lim_{T \rightarrow \infty} f(\widehat{\Theta}) = (\hat{D}-D)^2 + \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2.</script>
<p>The first term, $(\hat D - D)^2$ is convex and appears nowhere else. We can safely ignore it and focus on the remaining expression instead, which we call the <em>idealized risk</em>:</p>
<script type="math/tex; mode=display">g(\widehat{\Theta}) = \sum_{k=0}^{\infty} (\hat{C}\hat{A}^kB-CA^k B)^2</script>
<p>To deal with the sequence $\hat{C}\hat{A}^kB$, we take its Fourier transform and obtain that</p>
<script type="math/tex; mode=display">\hat{C}\hat{A}^kB, k\ge 1 ~~~~\longrightarrow ~~~~~~~\widehat{G}_{\lambda} = \frac{\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n}{e^{n\lambda} + \hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n}, \lambda\in [0,2\pi]</script>
<p>Similarly we take the Fourier transform of $CA^kB$, denoted by $G_{\lambda}$. Then by Parseval’s Theorem, we obtain the following alternative representation of the population risk,</p>
<script type="math/tex; mode=display">f(\widehat{\Theta}) = \int_{0}^{2\pi} |G_{\lambda}-\widehat{G}_{\lambda}|^2 d\lambda.</script>
<p>Mapping out $G_\lambda$ and $\widehat G_\lambda$ for all $\lambda\in [0, 2\pi]$ gives the following picture:</p>
<div style="text-align:center;">
<img style="width:400px;" src="/assets/sysid/transfer/approx-10.png" onclick="forward_transfer_image()" />
<img style="width:400px;" id="transfer-img" src="/assets/sysid/transfer/approx-00.png" onclick="forward_transfer_image()" />
</div>
<script type="text/javascript">//<![CDATA[
var transfer_images = [
"/assets/sysid/transfer/approx-00.png",
"/assets/sysid/transfer/approx-01.png",
"/assets/sysid/transfer/approx-02.png",
"/assets/sysid/transfer/approx-03.png",
"/assets/sysid/transfer/approx-04.png",
"/assets/sysid/transfer/approx-05.png",
"/assets/sysid/transfer/approx-06.png",
"/assets/sysid/transfer/approx-07.png",
"/assets/sysid/transfer/approx-08.png",
"/assets/sysid/transfer/approx-09.png",
"/assets/sysid/transfer/approx-10.png",
]
var iA = 0
function forward_transfer_image(){
iA = iA + 1;
document.getElementById('transfer-img').src = transfer_images[iA%11];
document.getElementById('transfer-counter').textContent = (iA%11);
}
//]]>
</script>
<p><em>Left: Target transfer function $G$. Right: Approximation $\widehat G$ at step <span style="font-family:monospace" id="transfer-counter">0</span>/10. Click to advance.</em></p>
<p>Given this pretty representation of the idealized risk objective, we can finally prove our main lemma.</p>
<blockquote>
<p><strong>Lemma:</strong> Suppose $\Theta$ satisfies the Pac-Man condition. Then,
for every $0\le \lambda\le 2\pi$, $|G_{\lambda}-\widehat{G}_{\lambda}|^2$,
as a function of $\hat{A},\hat{C}$ is quasi-convex in the Pac-Man region.</p>
</blockquote>
<p>The lemma reduces to the following simple claim.</p>
<blockquote>
<p><strong>Claim:</strong> The function $h(\hat{u},\hat{v}) = |\hat{u}/\hat{v} - u/v|^2$ is quasi-convex in the region where $Re(\hat{v}/v) > 0$.</p>
</blockquote>
<p>The proof simply involves computing the gradients and checking the conditions for quasi-convexity by elementary algebra. We omit a formal proof, but intead show a plot of the function $h(\hat{u}, \hat{v}) = (\hat{u}/\hat{v}- 1)^2$ over the reals:</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img style="height:600px" id="3dplot-img" src="/assets/sysid/3dplot/3dplot-30.jpg" onclick="forward_3dplot_image()" />
<p style="text-align:center;"> Click to rotate.</p>
</div>
<script type="text/javascript">//<![CDATA[
var plot3d_images = [
"/assets/sysid/3dplot/3dplot-0.jpg",
"/assets/sysid/3dplot/3dplot-10.jpg",
"/assets/sysid/3dplot/3dplot-20.jpg",
"/assets/sysid/3dplot/3dplot-30.jpg",
"/assets/sysid/3dplot/3dplot-40.jpg",
"/assets/sysid/3dplot/3dplot-50.jpg",
"/assets/sysid/3dplot/3dplot-60.jpg",
"/assets/sysid/3dplot/3dplot-70.jpg",
"/assets/sysid/3dplot/3dplot-80.jpg",
"/assets/sysid/3dplot/3dplot-90.jpg",
]
var iB = 3
var inc_sign = 1
function forward_3dplot_image(){
iB = iB + inc_sign;
if (iB == 9) {
inc_sign = -1;
}
if (iB == 0) {
inc_sign = 1;
}
document.getElementById('3dplot-img').src = plot3d_images[iB];
}
//]]>
</script>
<p><!-- end animation --></p>
<p>To see how the lemma follows from the previous claim we note that quasi-convexity is preserved under composition with any linear transformation. Specifically, $h(z)$ is quasi-convex, then $h(R x)$ is also quasi-convex for any linear map $R$. So, consider the linear map:</p>
<script type="math/tex; mode=display">(\hat{a},\hat{c})\mapsto (\hat u, \hat v) = (\hat{c}_1e^{(n-1)\lambda}+\dots+ \hat{c}_n, e^{n\lambda}
+\hat{a}_1e^{(n-1)\lambda}+\dots+\hat{a}_n)</script>
<p>With this linear transformation, our simple claim about a bivariate function extends to show that $(G_{\lambda}-\widehat{G}_{\lambda})^2$ is quasi-convex when $Re(\hat{v}/v) \ge 0$. In particular, when $\hat{a}$ and $a$ both satisfy the Pac-Man condition, then $\hat{v}$ and $v$ both reside in the 90 degree wedge. Therefore they have an angle smaller than 90 degree. This implies that $Re(\hat{v}/v) > 0$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We saw conditions under which stochastic gradient descent successfully learns a linear dynamical system. In <a href="https://arxiv.org/abs/1609.05191">our paper</a>, we further show that allowing our learned system to have more parameters than the target system makes the problem dramatically easier. In particular, at the expense of slight over-parameterization we can weaken the Pac-Man condition to a mild separation condition on the roots of the characteristic polynomial. This is consistent with empirical observations both in machine learning and control theory that highlight the effectiveness of additional model parameters.</p>
<p>More broadly, we hope that our techniques will be a first stepping stone toward a better theoretical understanding of recurrent neural networks.</p>
Thu, 13 Oct 2016 10:00:00 +0000
http://blog.mrtz.org/2016/10/13/gradient-descent-learns-dynamical-systems.html
http://blog.mrtz.org/2016/10/13/gradient-descent-learns-dynamical-systems.htmlApproaching fairness in machine learning<p>As machine learning increasingly affects domains protected by anti-discrimination law, there is much interest in the problem of algorithmically measuring and ensuring fairness in machine learning. Across academia and industry, experts are finally embracing this important research direction that has long been marred by sensationalist clickbait overshadowing scientific efforts.</p>
<p>This sequence of posts is a sober take on the subtleties and difficulties in engaging productively with the issue of fairness in machine learning. Prudence is necessary, since a poor regulatory proposal could easily do more harm than doing nothing at all.</p>
<p>In this first post, I will focus on a sticky idea I call <em>demographic parity</em> that through its many variants has been proposed as a fairness criterion in dozens of papers. I will argue that demographic parity not only cripples machine learning, it also fails to guarantee fairness.</p>
<p>In a second post, I will introduce you to a measure of fairness put forward in a recent joint work with <a href="http://www.cs.utexas.edu/~ecprice/">Price</a> and <a href="http://ttic.uchicago.edu/~nati/">Srebro</a> that addresses the main conceptual shortcomings of demographic parity, while being fairly easy to apply and to interpret. A third post will use our framework to interpret the recent controversy on COMPAS scores. Finally, I’ll have an entire post on limitations of our work, and avenues for future research. My claims of future posts have been mostly wrong in the past except these posts are actually already written.</p>
<p>So, if you’re interested in the topic, but less so in seeing pictures of Terminator alongside vague claims about AI, stay on for this blog post series and join the discussion.</p>
<h2 id="a-call-to-action">A call to action</h2>
<p>Domains such as advertising, credit, education, and employment can all hugely benefit from modern machine learning techniques, but some are concerned that algorithms might introduce new biases or perpetuate existing ones. Indeed, the Obama Administration’s Big Data Working Group <a href="https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf">argued in 2014</a> that discrimination may "be the inadvertent outcome of the way big
data technologies are structured and used" and pointed toward "the potential
of encoding discrimination in automated decisions".</p>
<p>It’s important to understand on technical grounds how decisions based on machine learning might wind up being unfair without any explicit wrongdoing. I wrote a post on this <a href="https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de#.llzo69u3p">a while back</a>. Some will be disappointed to find that the core issues have little to do with Silicon Valley billionaires or the singularity, and more to do with fundamental conceptual and technical challenges.</p>
<p>But it’s also important to understand that often the use of algorithms exposes biases present in society rather than creating new ones. There is an exciting opportunity of reducing bias as we move to an algorithmic ecosystem.</p>
<p>Despite the need, a vetted methodology for measuring fairness in machine learning is lacking. Historically, the naive approach to fairness has been to assert that the algorithm simply doesn’t look at <em>protected attributes</em> such as race, color, religion, gender, disability, or family status. So, how could it discriminate? This idea of <em>fairness through blindness</em>, however, fails due to the existence of <em>redundant encodings</em>. There are almost always ways of predicting unknown protected attributes from other seemingly innocuous features.</p>
<h2 id="demographic-parity">Demographic parity</h2>
<p>After ruling out <em>fairness through blindness</em>, the next idea that springs to mind is <em>demographic parity</em>. Demographic parity requires that a decision—such as accepting or denying a loan application—be independent of the protected attribute. In the case of a binary decision <script type="math/tex">C\in\{0,1\}</script> and a binary protected attribute <script type="math/tex">A\in\{0,1\}</script>, this constraint can be formalized by asking that</p>
<script type="math/tex; mode=display">\mathbb{P}\{C=1 \mid A=0\}=\mathbb{P}\{C=1 \mid A=1\}.</script>
<p>In other words, membership in a protected class should have no correlation with the decision. Through its various equivalent formalizations and other variants this idea appears in numerous papers. For example, in the context of representation learning, it is tempting to ask that the
learned representation has zero <a href="https://en.wikipedia.org/wiki/Mutual_information">mutual information</a> with the protected attribute. Any classifier based on the learned representation will then inevitably satisfy demographic parity.</p>
<p>Unfortunately, the notion is seriously flawed on two counts.</p>
<h3 id="demographic-parity-doesnt-ensure-fairness">Demographic parity doesn’t ensure fairness</h3>
<p>The notion permits that a classifier selects qualified applicants in the demographic <script type="math/tex">A=0</script>, but unqualified individuals in <script type="math/tex">A=1</script>, so long as the percentages of acceptance match. Consider, for example, a luxury hotel chain that renders a promotion to a subset of wealthy whites (who are likely to visit the hotel) and a subset of less affluent blacks (who are unlikely to visit the hotel). The situation is obviously quite icky, but demographic parity is completely fine with it so long as the same fraction of people in each group see the promotion.</p>
<p>You might argue that it’s never in the advertiser’s interest to lose potential customers. But the above scenario can arise naturally when there is less training data available about a minority group. As a result, the advertiser might have a much better understanding of who to target in the majority group, while essentially random guessing within the minority.</p>
<h3 id="demographic-parity-cripples-machine-learning">Demographic parity cripples machine learning</h3>
<p>Imagine we could actually see into the future, or, equivalently we had a perfect predictor of future events. Formally, this predictor <script type="math/tex">C</script> would equal the <em>target variable</em> <script type="math/tex">Y</script> that we’re trying to predict with probability one. Oversimplifying the example of advertising, this predictor would tell us precisely who will actually purchase the product and who won’t. Assuming for a moment that such a perfect predictor exists, using <script type="math/tex">C</script> for targeted advertising can then hardly be considered discriminatory as it reflects actual purchase intent. But in advertising, and many other problems, the variable <script type="math/tex">Y</script> usually has some positive or negative correlation with membership in the protected group <script type="math/tex">A</script>. This isn’t by itself a cause for concern as interests naturally vary from one group to another. But in any such setting, demographic parity would rule out the ideal predictor <script type="math/tex">C = Y</script>. As a result, the loss in utility of imposing demographic parity can be substantial <em>for no good reason</em>. Demographic parity is simply misaligned with the fundamental goal of achieving higher prediction accuracy.</p>
<p>To be sure, there is a set of applications for which demographic parity is not unreasonable, but this seems to be a subtle case to make. Any paper adopting demographic parity as a general measure of fairness is fundamentally flawed.</p>
<h2 id="moving-forward">Moving forward</h2>
<p>As I argued, demographic parity is both too strong and too weak. Hence, if you find a different condition that’s strictly stronger, it’ll still be flawed. If you come up with something strictly weaker, it’ll still be flawed. You also won’t salvage demographic parity by finding more impressive ways of achieving it such as backpropping through a 1200 layer neural net with spatial attention and unconscious sigmund activations.</p>
<p>If demographic parity is broken, why did I belabor its failure at such length?</p>
<p>One reason is to discourage people from writing more and more papers about it without a compelling explanation of why it makes sense for their application. The more important reason, however, is that the failure of demographic parity is actually quite an instructive lesson to learn. In particular, we will see how it suggests a much more reasonable notion that I will introduce in my next post.</p>
<p><em>Stay on top of future posts. Subscribe to the <a href="http://blog.mrtz.org/feed.xml">RSS feed</a>, or follow me on <a href="https://twitter.com/mrtz">Twitter</a>.</em></p>
Tue, 06 Sep 2016 15:30:00 +0000
http://blog.mrtz.org/2016/09/06/approaching-fairness.html
http://blog.mrtz.org/2016/09/06/approaching-fairness.htmlfairnessmachine learningstatisticsStability as a foundation of machine learning<p><em>Cross-posted at <a href="http://www.offconvex.org/2016/03/14/stability/">offconvex.org</a>.</em></p>
<p>Central to machine learning is our ability to relate how a learning algorithm fares on a sample to its performance on unseen instances. This is called <em>generalization</em>.</p>
<p>In this post, I will describe a purely algorithmic approach to generalization. The property that makes this possible is <em>stability</em>. An algorithm is <em>stable</em>, intuitively speaking, if its output doesn’t change much if we perturb the input sample in a single point. We will see that this property by itself is necessary and sufficient for generalization.</p>
<h2 id="example-stability-of-the-perceptron-algorithm">Example: Stability of the Perceptron algorithm</h2>
<p>Before we jump into the formal details, let’s consider a simple example of a stable algorithm: The <a href="https://en.wikipedia.org/wiki/Perceptron">Perceptron</a>, aka stochastic gradient descent for learning linear separators! The algorithm aims to separate two classes of points (here circles and triangles) with a linear separator. The algorithm starts with an arbitrary hyperplane. It then repeatedly selects a single example from its input set and updates its hyperplane using the gradient of a certain loss function on the chosen example. How bad might the algorithm screw up if we move around a single example? Let’s find out.</p>
<p><!-- begin animation --></p>
<div style="text-align:center;">
<img id="imganim" src="/assets/sgd/00.png" onclick="forward_image()" />
<p style="text-align:center;"><em>Step <span style="font-family:monospace;"><span id="counter">1</span>/30</span>. Click to advance.<br /> The animation shows two runs of the Perceptron algorithm for learning a linear separator on two data sets that differ in the one point marked green in one data set and purple in the other. The perturbation is indicated by an arrow. The shaded green region shows the difference in the resulting two hyperplanes after some number of steps. </em></p>
</div>
<script type="text/javascript">//<![CDATA[
var images = [
"/assets/sgd/00.png",
"/assets/sgd/01.png",
"/assets/sgd/02.png",
"/assets/sgd/03.png",
"/assets/sgd/04.png",
"/assets/sgd/05.png",
"/assets/sgd/06.png",
"/assets/sgd/07.png",
"/assets/sgd/08.png",
"/assets/sgd/09.png",
"/assets/sgd/10.png",
"/assets/sgd/11.png",
"/assets/sgd/12.png",
"/assets/sgd/13.png",
"/assets/sgd/14.png",
"/assets/sgd/15.png",
"/assets/sgd/16.png",
"/assets/sgd/17.png",
"/assets/sgd/18.png",
"/assets/sgd/19.png",
"/assets/sgd/20.png",
"/assets/sgd/21.png",
"/assets/sgd/22.png",
"/assets/sgd/23.png",
"/assets/sgd/24.png",
"/assets/sgd/25.png",
"/assets/sgd/26.png",
"/assets/sgd/27.png",
"/assets/sgd/28.png",
"/assets/sgd/29.png" ]
var i = 0
function forward_image(){
i = i + 1;
document.getElementById('imganim').src = images[i%30];
document.getElementById('counter').textContent = (i%30) + 1;
}
//]]>
</script>
<p><!-- end animation --></p>
<p>As we can see by clicking impatiently through the example, the algorithm seems pretty stable. Even if we substantially move the first example it encounters, the hyperplane computed by the algorithm changes only slightly. Neat. (You can check out the code <a href="https://gist.github.com/mrtzh/266c37d3a274376134a6">here</a>.)</p>
<h2 id="empirical-risk-jargon">Empirical risk jargon</h2>
<p>Let’s introduce some terminology to relate the behavior of an algorithm on a sample to its behavior on unseen instances. Imagine we have a sample $S=(z_1,\dots,z_n)$ drawn i.i.d. from some unknown distribution $D$. There’s a learning algorithm $A(S)$ that takes $S$ and produces some model (e.g., the hyperplane in the above picture). To quantify the quality of the model we crank out a <em>loss function</em> $\ell$ with the idea that $\ell(A(S), z)$ describes the <em>loss</em> of the model $A(S)$ on one instance $z$. The <em>empirical risk</em> or <em>training error</em> of the algorithm is defined as:</p>
<script type="math/tex; mode=display">R_S = \frac1n \sum_{i=1}^n \ell(A(S), z_i)</script>
<p>This captures the average loss of the algorithm on the sample on which it was trained. To quantify <em>out-of-sample</em> performance, we define the <em>risk</em> of the algorithm as:</p>
<script type="math/tex; mode=display">R = \mathop{\mathbb{E}}_{z\sim D}\left[ \ell(A(S), z) \right]</script>
<p>The difference between risk and empirical risk $R - R_S$ is called <em>generalization error</em>. You will sometimes encounter that term as a synonym for risk, but I find that confusing. We already have a perfectly short and good name for the risk $R$. Always keep in mind the following tautology</p>
<script type="math/tex; mode=display">R = R_S + (R-R_S).</script>
<p>Operationally, it states that if we manage to minimize empirical risk all that matters is generalization error.</p>
<h2 id="a-fundamental-theorem-of-machine-learning">A fundamental theorem of machine learning</h2>
<p>I probably shouldn’t propose fundamental theorems for anything really. But if I had to, this would be the one I’d suggest for machine learning:</p>
<p><strong>In expectation, generalization equals stability.</strong></p>
<p>Somewhat more formally, we will encounter a natural measure of stability, denoted $\Delta$ such that the difference between risk and empirical risk in expectation equals $\Delta.$ Formally,</p>
<script type="math/tex; mode=display">\mathbb{E}[R - R_S] = \Delta</script>
<p>Deferring the exact definition of $\Delta$ to the proof, let’s think about this for a second.
What I find so remarkable about this theorem is that it turns a statistical problem into a purely algorithmic one: All we need for generalization is an algorithmic notion of robustness. Our algorithm’s output shouldn’t change much if perturb one of the data points. It’s almost like a sanity check. Had you coded up an algorithm and this wasn’t the case, you’d probably go look for a bug.</p>
<h3 id="proof">Proof</h3>
<p>Consider two data sets of size $n$ drawn independently of each other:
[
S = (z_1,\dots,z_n), \qquad S’=(z_1’,\dots,z_n’)
]
The idea of taking such a <em>ghost sample</em> $S’$ is quite old and already arises in the context of <em>symmetrization</em> in empirical process theory.
We’re going to couple these two samples in one point by defining
[
S^i = (z_1,\dots,z_{i-1},z_i’,z_{i+1},\dots,z_n),\qquad i = 1,\dots, n.
]
It’s certainly no coincidence that $S$ and $S^i$ differ in exactly one element. We’re going to use this in just a moment.</p>
<p>By definition, the <em>expected empirical risk</em> equals</p>
<script type="math/tex; mode=display">\mathbb{E}[R_S] = \mathbb{E}\left[ \frac1n \sum_{i=1}^n \ell(A(S), z_i) \right].</script>
<p>Contrasting this to how the algorithm fares on unseen examples, we can rewrite the <em>expected risk</em> using our ghost sample as:</p>
<script type="math/tex; mode=display">\mathbb{E}[R] = \mathbb{E}\left[ \frac1n \sum_{i=1}^n \ell(A(S), \color{red}{z_i'}) \right]</script>
<p>All expectations we encounter are over both $S$ and $S’$. By linearity of expectation, the difference between expected risk and expected empirical risk equals</p>
<script type="math/tex; mode=display">\mathbb{E}[R - R_S]
= \frac1n \sum_{i=1}^n
\mathbb{E}\left[\ell(A(S), \color{red}{z_i'})-\ell(A(S), z_i)\right].</script>
<p>It is tempting now to relate the two terms inside the expectation to the stability of the algorithm. We’re going to do exactly that using mathematics’ most trusted proof strategy: <em>pattern matching</em>. Indeed, since $z_i$ and $z_i’$ are exchangeable, we have</p>
<script type="math/tex; mode=display">\mathbb{E}[\ell(A(S), z_i)]
= \mathbb{E}[\ell(A(S^i), z_i')]
= \mathbb{E}[\ell(A(S), z_i')] - \delta_i,</script>
<p>where $\delta_i$ is defined to make the second equality true:</p>
<script type="math/tex; mode=display">\delta_i = \mathbb{E}[\ell(A(\color{red}S), z_i')- \ell(A(\color{red}{S^i}), z_i')]</script>
<p>Summing up $\Delta = (1/n)\sum_i \delta_i$, we have</p>
<script type="math/tex; mode=display">\mathbb{E}[ R - R_S ] = \Delta.</script>
<p>The only thing left to do is to interpret the right hand side in terms of stability. Convince yourself that $\delta_i$ measures how differently the algorithm behaves on two data sets $S$ and $S’$ that differ in only one element.</p>
<h3 id="uniform-stability">Uniform stability</h3>
<p>It can be difficult to analyze the expectation in the definition of $\Delta$ precisely. Fortunately, it is often enough to resolve the expectation by upper bounding it with suprema:</p>
<script type="math/tex; mode=display">|\Delta| \le \sup_{S,S'} \sup_{z} \left|\ell(A(S),z)-\ell(A(S'),z)\right|.</script>
<p>The supremum runs over all valid data sets differing in only one element and all valid sample points $z$. This stronger notion of stability called <em>uniform stability</em>
goes back to a seminal paper by Bousquett and Elisseeff.</p>
<p>I should say that you can find the above proof in the essssential stability paper by Shalev-Shwartz, Shamir, Srebro and Sridharan <a href="http://jmlr.csail.mit.edu/papers/volume11/shalev-shwartz10a/shalev-shwartz10a.pdf">here</a>.</p>
<h3 id="concentration-from-stability">Concentration from stability</h3>
<p>The theorem we saw shows that <em>expected</em> empirical risk equals risk up to a correction that involves the stability of the algorithm. Can we also show that empirical risk is close to its expectation with high probability? Interestingly, we can by appealing to stability once again. I won’t spell out the details, but we can use the <a href="https://en.wikipedia.org/wiki/Doob_martingale#McDiarmid.27s_inequality">method of bounded differences</a> to obtain strong concentration bounds. To apply the method we need a <em>bounded difference</em> condition which is just another word for <em>stability</em>. So, we’re really killing two birds with one stone by using stability not only to show that the first moment of the empirical risk is correct but also that it concentrates. The only wrinkle is that, as far as I know, the weak stability notion expressed by $\Delta$ is not enough to get concentration, but uniform stability (for sufficiently small difference) will do.</p>
<h2 id="applications-of-stability">Applications of stability</h2>
<p>There is much more that stability can do for us. We’ve only scratched on the surface. Here are some of the many applications of stability.</p>
<ul>
<li>
<p><a href="http://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf">Regularization implies stability</a>. Specifically, the minimizer of the empirical risk subject to an $\ell_2$-penalty is uniformly stable.</p>
</li>
<li>
<p><a href="http://arxiv.org/abs/1509.01240">Stochastic gradient descent is stable</a> provided that we don’t make too many steps.</p>
</li>
<li>
<p>Differential privacy is nothing but a strong stability guarantee. Any result ever proved about differential privacy is fundamentally about stability.</p>
</li>
<li>
<p>Differential privacy in turn has applications to preventing overfitting in <a href="http://blog.mrtz.org/2015/12/14/adaptive-data-analysis.html">adaptive data analysis</a>.</p>
</li>
<li>
<p>Stability also has many beautiful applications and connections in statistics. I strongly encourage you to read Bin Yu’s beautiful <a href="https://www.stat.berkeley.edu/~binyu/ps/papers2013/Yu13.pdf">overview paper</a> on the topic.</p>
</li>
</ul>
<p>Looking ahead, I’ve got at least two more posts planned on this.</p>
<p>In my next post I will go into the stability of stochastic gradient descent in detail. We will see a simple argument to show that stochastic gradient descent is uniformly stable. I will then work towards applying these ideas to the area of deep learning. We will see that stability can help us explain why even huge models sometimes generalize well and how we can make them generalize even better.</p>
<p>In a second post I will reflect on stability as a paradigm for reliable machine learning. The focus will be on how ideas from stability can help avoid overfitting and false discovery.</p>
Mon, 14 Mar 2016 08:00:00 +0000
http://blog.mrtz.org/2016/03/14/stability.html
http://blog.mrtz.org/2016/03/14/stability.htmlmachine learningAdaptive data analysis<p>I just returned from <a href="https://nips.cc/Conferences/2015/">NIPS 2015</a>, a joyful week
of corporate parties featuring deep learning themed cocktails, <a href="http://www.nytimes.com/2015/12/12/science/artificial-intelligence-research-center-is-founded-by-silicon-valley-investors.html?_r=0">money
talk</a>,
recruiting events, and some scientific activities on the side. In the latter
category, I co-organized a <a href="http://wadapt.org">workshop on adaptive data
analysis</a> with Vitaly Feldman, Aaron Roth and Adam Smith.</p>
<p>Our workshop responds to an increasingly pressing issue in machine learning and
statistics. The classic view of these fields has it that we choose our method
independently of the data to which we intend to apply the method. For example,
a hypothesis test must be fixed in advance before we see the data. I sometimes
call this <em>static</em> or <em>confirmatory</em> data analysis. You need to know exactly
what you want to do before you collect the data and run your experiment. In
contrast, in practice we typically choose our methods as a function of the data
to which we apply them. In other words, we <em>adapt</em> our method to the data.</p>
<p><img src="/assets/static-adaptive.jpg" alt="static vs adaptive" /></p>
<p>Adaptivity is both powerful and dangerous. Working adaptively gives us greater
flexibility to make unanticipated discoveries as it allows us to execute more
complex analysis work flows. But it can also lead to false discovery and
misleading conclusions far more easily than static data analysis. While there
are many other issues to keep in mind, is not unreasonable to blame our lack
of understanding adaptivity in part for exacerbating a range of problems from
<a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">false discovery in the empirical
sciences</a>
to <a href="http://blog.mrtz.org/2015/03/09/competition.html">overfitting in machine learning
competitions</a>.</p>
<p>I was hugely impressed with the incredibly exciting group of participants and
audiences from both computer science and statistics that our workshop brought
together. I felt that we made actual progress on beginning to understand the
diverse perspectives on this complex issue. The goal of this post is to convey
my excitement for this emerging research area as I attempt to summarize the
different perspectives we saw.</p>
<h2 id="the-frequentist-statistics-perspective">The frequentist statistics perspective</h2>
<p>Null hypothesis tests are still widely used across the empirical sciences to
gauge the validity of findings. Scientists routinely calculate
<a href="https://en.wikipedia.org/wiki/P-value">p-values</a> with the hope of being able
to reject a null hypothesis and thus claim a “statistically significant”
finding. If we carry out multiple hypothesis tests, we need to adjust our
p-values for the fact that we made multiple tests. A safe way of correcting for
multiple tests is the <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni
correction</a> which amounts
to multiplying all p-values by the number of tests. Computer scientists call this
the union bound. While Bonferroni is safe, it makes discoveries difficult in
the common situation where we have lots of tests and no individual signal is
particularly strong. A more
<a href="https://en.wikipedia.org/wiki/Statistical_power">powerful</a> alternative to
Bonferroni is to control the <a href="https://en.wikipedia.org/wiki/False_discovery_rate">False Discovery
Rate</a> (FDR) proposed in a
celebrated paper by Benjamini and Hochberg. Intuitively, controlling FDR
amounts to putting a bound on the expected ratio of all false discoveries
(number of rejected true null hypotheses) to all discoveries (number of
rejected nulls). The famous Benjamini-Hochberg procedure gives one beautiful
way of doing this.</p>
<h3 id="online-false-discovery-rate">Online False Discovery Rate</h3>
<p>Andrea Montanari and Dean Foster discussed more recent works on <a href="http://arxiv.org/abs/1502.06197">online
variants of FDR</a>. Here, the goal is to control
FDR not at the end of the testing process, but rather at all points along the
way. In particular, the scientist must choose whether or not to reject a
hypothesis at any time point without knowing the outcome of future tests. The
word <em>online</em> should not be confused with <em>adaptive</em>. Although we could in
principle choose hypotheses adaptively in this framework, we still need the
traditional assumptions that all p-values are independent of each other and
distributed the way they should be (i.e., uniform if the null hypothesis is
true). If the selection of hypothesis tests was truly adaptive, these
assumptions are unlikely to be satisfied and hard to be verified at any rate.</p>
<h3 id="inference-after-selection">Inference after selection</h3>
<p>But what if our hypothesis tests are chosen as a function of the data? For
example, what if we first choose a set of promising data attributes and test
only these attributes for significance? This natural two-step procedure is
called <em>inference after selection</em> in statistics. Rob Tibshirani gave an entire
keynote about this topic. Will Fithian went into further detail in his talk at
our workshop. Several recent works in this area show how we can first perform
variable selection using an algorithm such as
<a href="http://statweb.stanford.edu/~tibs/lasso.html">Lasso</a>, followed by hypothesis
testing on the selected variables. What makes these results possible is a
careful analysis of the distribution of the p-values conditional on the
selection by Lasso. For example, if the test statistic followed a normal
distribution before selection, it will follow a certain truncated normal
distribution after selection. This approach leads to very accurate
characterizations and often tight confidence intervals. However, it has the
shortcoming that the we need to commit to a particular selection and inference
procedure as the analysis crucially exploits these.</p>
<p>Rina Foygel Barber expanded on the theme of adaptivity and false discovery rate
by showing how to control FDR when we first <a href="http://arxiv.org/abs/1505.07352">order our hypothesis
tests</a> in a data-dependent manner to allow for
making important discoveries sooner.</p>
<h2 id="the-bayesian-view">The Bayesian view</h2>
<p>Andrew Gelman’s talk (see <a href="http://wadapt.org/slides/gelman.pdf">his slides</a>)
contributed many illustrative examples of flawed empirical studies. His
suggested cure for the woes of adaptivity was to do <em>more</em> of it. However, he
advocated <a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling">Bayesian hierarchical
modeling</a> to
explicitly account for all the possible outcomes of an adaptive study. When
asked if this wouldn’t put too much burden on the side of modeling, he replied
that modeling should be seen as a feature as it forces people to explicitly
discuss the experimental setup.</p>
<h2 id="the-stability-approach">The stability approach</h2>
<p>An intriguing approach to generalization in machine learning is the idea of
<em>stability</em>. An algorithm is <em>stable</em> if its output changes only slightly (in
some formal sense) under an arbitrary substitution of a single input data
point. A seminal work of Bousquet and Elisseeff showed that <a href="http://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf">stability implies
generalization</a>.
That is, patterns observed by a <em>stable</em> algorithm on a sample must also exist
in the underlying population from which the sample was drawn. Nati Srebro
explained in his talk how stability is a universal property in the sense that
it is both <a href="http://jmlr.csail.mit.edu/papers/volume11/shalev-shwartz10a/shalev-shwartz10a.pdf">sufficient and necessary for
learnability</a>.</p>
<p><img src="/assets/stability.png" alt="stability" /></p>
<p>The issue is that stability by itself doesn’t address adaptivity. Indeed, the
classic works on stability apply to the typical <em>non-adaptive</em> setting of
learning where the sample is independent of the learning algorithm.</p>
<p>Cynthia Dwork talked about recent works that address this shortcoming.
Specifically, differential privacy is a stability notion that applies even in
the setting of adaptive data analysis. Hence, <a href="http://arxiv.org/abs/1411.2664">differential privacy implies
validity in adaptive data analysis</a>. Drawing on
many powerful adaptive data analysis algorithms from differential privacy, this
gives a range of statistically valid tools for adaptive data analysis. See, for
example, my blog post on the <a href="http://googleresearch.blogspot.com/2015/08/the-reusable-holdout-preserving.html">reusable holdout
method</a>
that came out of this line of work.</p>
<h2 id="information-theoretic-measures">Information-theoretic measures</h2>
<p>The intuition behind differential privacy is that if an analysis does not
reveal too much about the specifics of a data set, then it is impossible to
overfit to the data set based on the available information. This suggests an
information-theoretic approach that quantifies the <em>mutual information</em> between
the sample and the output of the algorithm. A <a href="http://arxiv.org/abs/1506.02629">recent
work</a> shows that indeed certain strengthenings
of mutual information prevent overfitting. Being completely general, this
viewpoint allows us, for example, to discuss deterministic algorithms, whereas
all differentially private algorithms are randomized. James Zou discussed the
information-theoretic approach further and mentioned his <a href="http://arxiv.org/abs/1511.05219">recent
work</a> on this topic.</p>
<h2 id="complexity-theoretic-obstructions">Complexity-theoretic obstructions</h2>
<p>In a dramatic turn of events, Jon Ullman told us why <a href="http://arxiv.org/abs/1408.1655">preventing false
discovery in adaptive data analysis can be computationally
intractable</a>.</p>
<p>To understand the result, think of adaptive data analysis for a moment as an
interactive protocol between the data analyst and the algorithm. The algorithm
has access to a sample of size \(n\) and the analyst wants to learn about the
underlying population by asking the algorithm a sequence of adaptively chosen
questions.</p>
<p>What Jon talked about is that there is a computationally efficient adaptive
data analyst that can force any computationally efficient algorithm on a sample
of size \(n\) to return a completely invalid estimate after no more than
\(n^2\) adaptively chosen questions. Non-adaptively, this would require
exponentially many questions. While this is a worst-case hardness result, it
points to a computational barrier for what might have looked like a purely
statistical problem. The \(n^2\) bound turns out to be tight in light of the
differential privacy based positive results I mentioned earlier.</p>
<h2 id="panel-discussion">Panel discussion</h2>
<p>Toward the end of our workshop, we had a fantastic panel discussion on open
problems and conceptual questions. I don’t remember all of it, unfortunately.
Below are some topics that got stuck in my mind. If you remember more, please
leave a comment.</p>
<h3 id="exact-modeling-versus-worst-case-assumptions">Exact modeling versus worst-case assumptions</h3>
<p>The approaches we saw cluster around two extremes. One is the case of exact
analysis or modeling, for example, in the work on inference after selection. In
these cases, we are able to exactly analyze the conditional distributions
arising as a result of adaptivity and adjust our methods accordingly. The work
on differential privacy in contrast makes no assumptions on the analyst. Both
approaches have advantages and disadvantages. One is more exact and
quantitatively less wasteful, but applies only to specific procedures. The
other is more general and less restrictive to the analyst, but this generality
also leads to hardness results. A goal for future work is to find middle
ground.</p>
<h3 id="human-adaptivity-versus-algorithmic-adaptivity">Human adaptivity versus algorithmic adaptivity</h3>
<p>There is a curious distinction we need to draw. Algorithms can be adaptive.
<a href="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html">Gradient
descent</a>, for
example, is adaptive because it probes the data at a sequence of adaptively
chosen points. Lasso followed by inference is another example of a single
algorithm that exhibits some adaptive behavior. However, adaptivity also arises
through the way humans work adaptively with the data. It even arises when a
group of researchers study the same data set through a sequence of
publications. Each publication builds on previous insights and is hence
adaptive. It is much harder to analyze human adaptivity than algorithmic
adaptivity. How can we even quantify the extent to which this is a problem?
Take <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a>, for example. For almost two
decades this data set has been used as a benchmark with the <em>same</em> test set.
Although no individual algorithm is directly trained against the test set, it
is quite likely that the sequence of proposed algorithms over the years
strongly overfits to the test data. It seems that this is an issue we should
take more seriously.</p>
<h3 id="building-better-intuition-in-practice">Building better intuition in practice</h3>
<p>Any practical application of the algorithms we discussed will likely violate
either the modeling assumptions we make or the quantitative regime to which our
theory applies. This shouldn’t be surprising. Even in non-adaptive data
analysis we make assumptions (such as normal approximations) that are not
completely true in practice. What is different is that we have far less
intuition for what works in practice in the adaptive setting.</p>
<h3 id="whats-next">What’s next?</h3>
<p>After this workshop, I’m even more convinced that adaptive data analysis
is here to stay. I expect lots of work in this area with many new theoretical
insights. I also hope that these developments will lead to new tools that make
the practice of machine learning and statistics more reliable.</p>
Mon, 14 Dec 2015 02:00:00 +0000
http://blog.mrtz.org/2015/12/14/adaptive-data-analysis.html
http://blog.mrtz.org/2015/12/14/adaptive-data-analysis.htmltcsstatisticsNavigate the garden of the forking paths<p>Head on over to the Google Research blog for bleeding edge coverage of
<a href="http://googleresearch.blogspot.com/2015/08/the-reusable-holdout-preserving.html">The reusable holdout: Preserving validity in adaptive data
analysis</a>, a joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi,
Aaron Roth and Omer Reingold, to appear in <em>Science</em>, tomorrow.</p>
Thu, 06 Aug 2015 20:55:00 +0000
http://blog.mrtz.org/2015/08/06/reusable.html
http://blog.mrtz.org/2015/08/06/reusable.htmltcsstatisticsfalse discoverydataprivacyTowards practicing differential privacy<p>More than a year ago I wrote an article with the provocative title: <a href="http://blog.mrtz.org/2013/08/21/dp-practical.html">Is
Differential Privacy practical?</a>
The post was essentially one big buildup for
an epic follow-up post that I simply never wrote. Since then dozens
have asked me for an answer to this urgent question. Recently, after the post
hit the front page of Hacker News, half a dozen emails
inquired about the follow-up post that I had promised. Some <a href="https://news.ycombinator.com/item?id=9184479">speculated</a> that owing to <a href="http://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines">Betteridge’s law of
headlines</a>, the
answer was simply <em>no</em>.</p>
<p>Despite my venerable history of failing on various commitments and my apparent
peace with it, this situation went too far even by my own low standards.
So, I decided to write a not so epic version of that promised blog
post.</p>
<h2 id="the-california-public-utilities-commission">The California Public Utilities Commission</h2>
<p>I’ll arrange my thoughts around the case of the <a href="http://www.cpuc.ca.gov/puc/">California Public
Utilities Commission</a> (CPUC). The CPUC is a
regulatory agency that regulates privately owned public utilities in
California. In recent years there has been political pressure on the utilities
to give third parties access to smart meter data. As discussed in <a href="http://blog.mrtz.org/2013/08/21/dp-practical.html">my previous
post</a>, smart meter data is
of enormous value to many, but comes with serious privacy challenges.</p>
<p>To settle these issues the CPUC organized a major legal proceeding with the
goal of creating rules that provide access to energy usage data to local
government entities, researchers, and state and federal agencies while
establishing procedures that protect the privacy of consumer data.</p>
<p>I served as a privacy expert within the proceeding together with Cynthia Dwork, Lee Tien from the
<a href="http://www.eff.org">EFF</a>, and Jennifer Urban and her team from Berkeley.
Our goal was to inform various parties about the pitfalls
of insufficient privacy mechanisms and to propose better ones. Our proposed
solution focused on differential privacy for the uses cases in which it made
sense. There were a number of use cases that the CPUC considered. Not all of
them were well suited for differential privacy to begin with.</p>
<h3 id="a-proposed-decision">A proposed decision</h3>
<p>My involvement with the case ended in 2014 after a <a href="http://docs.cpuc.ca.gov/PublishedDocs/Efile/G000/M088/K947/88947979.PDF">proposed
decision</a>
of the administrative judge. To summarize a 120 page document in one sentence,
the ruling did not endorse differential privacy strongly enough for me to further pursue the case actively.
Nevertheless, there was still significant interest in differential privacy
from some of the utilities. I believe that one utilities company
engaged with Microsoft with the goal of building a prototype of a
differentially private solution for their data sharing needs.</p>
<p>The ruling was disappointing from my perspective in that it did not advocate
the use of differential privacy in any of the use cases. Meanwhile it shot
down several uses cases essentially not giving the use
case sponsors meaningful access to energy data at all. In those cases
differential privacy could’ve provided an obviously better trade-off for everyone.</p>
<p>The ruling didn’t so much reflect a technical verdict about differential privacy. Rather it reflected our inability to successfully anticipate and maneuver the highly complex political and legal environment in which the decision was made.</p>
<h3 id="a-post-mortem">A post mortem</h3>
<p>Our proposal based on differential privacy initially met with resoundingly positive
responses when we first presented it to the administrative judge and various
parties present in the meeting. We did however face bitter opposition from a
group of researchers who sponsored one use case. Those researchers, who had
been working with raw smart meter data in the past, were worried that differential privacy
would create an obstacle for them. We quickly realized that it would be
difficult to agree with them on the extent to which their research practices
are compatible with differential privacy. So, we specifically excluded their
use case from the scope of our proposal focusing on some of the remaining use cases instead. This
didn’t stop the researchers from lobbying relentlessly against differential
privacy. In particular, they filed a last minute comment in which they
attacked differential privacy sharply based on many profound factual misunderstandings
of the privacy notion. Due to the perfect timing of their comment, we were
unable to submit a rebuttal. In the end, I believe this alone was enough for the
administrative judge to conclude that the use of differential privacy was at
present too controversial to be proposed as a solution in the ruling.</p>
<p>My point is not to criticize this group of researchers. I’m sympathetic with
them. They’ve been working with energy data for many years. They’re doing important work which is probably already difficult enough as it is. We respected their
position and did not want to interfere with their research. My guess
is that their research practices are actually largely consistent with what’s
possible under differential privacy, but that’s an entirely separate
discussion.</p>
<p>What’s tragic is that their opposition ended up hurting a consumer advocacy
group who could’ve used differential privacy as a means to gain <em>more</em> access to
energy data than they were able to get in the end (essentially nothing). There was a lot of miscommunication throughout the proceeding that clearly didn’t help. For instance, initially the consumer advocacy group proposed their own ad-hoc privacy solution (which we didn’t support). Only later did we find some common ground. In hindsight, we should’ve agreed on and jointly represented the same solution from the beginning. In my understanding, the use case didn’t require more than the kind of aggregate usage statistics that we could’ve easily produced while preserving differential privacy without any major engineering efforts.</p>
<h2 id="towards-practicing-differential-privacy">Towards practicing differential privacy</h2>
<p>Drawing on my experience with the CPUC case, I want to end with some concrete
suggestions and questions hoping that they will help others when applying
differential privacy. When I speak of “the community”, I will make some very broad generalizations knowing full well that in each instance there are certainly exceptions to what I claim. The discussion below is by no means a survey as it contains very few links to the rich literature on differential privacy. I strongly encourage you to fill in relevant missing pointers in the comments.</p>
<h3 id="focus-on-win-win-applications">Focus on win-win applications</h3>
<p>Apply differential privacy as a tool to provide access to data where
currently access is problematic due to privacy regulations. Don’t fight the
data analyst. Don’t play the moral police. Imagine you are the analyst.</p>
<p>As a privacy expert,
you will find yourself having to shoot down inadequate solutions all the time.
Why can’t we just omit those 18 sensitive attributes like in the HIPAA safe harbor provision?
Why isn’t it safe to release any statistic that is aggregated over at least 15 households in which no
single household contributes 15% of the total number (i.e., the “15/15” rule)?
Such ad-hoc rules sound intuitively appealing to non-experts. Refuting them is time-consuming and makes
you look defensive.</p>
<p>Rather than shooting down what doesn’t work, point out why differential privacy is better than those solutions not just from a privacy perspective but rather from a <em>utility</em> perspective. Unlike these solutions,
differential privacy does not alter your data set at all. In particular, from a statistical perspective
you do not change the distribution from which the data were drawn. This is an incredibly powerful
proposition. I think that data analysis with differential privacy can be vastly more useful than
what you get after applying, for instance, the HIPAA safe harbor mechanism.</p>
<p>My point is that there are many “win/win” applications of differential privacy where it simultaneously can give better utility and better privacy than its alternatives. As the CPUC case showed, sometimes the choice is even between no access at all and differentially private access. It’s really a no-brainer. We should start with such applications instead of arguing about completely unrestricted access versus differentially private access.</p>
<h3 id="dont-empower-the-naysayers">Don’t empower the naysayers</h3>
<p>In my opinion, for differential privacy to be a major success in practice it would be sufficient if it were successful in <em>some</em> applications but certainly not in <em>all</em>—not even in most. There’s a culture of criticizing differential privacy based on the perfectly correct observation that some differentially private algorithm (say, Laplace) didn’t give enough utility in some application. These kind of observations—valid as they may be—say very little about the potential of differential privacy in practice. First of all, they only evaluate one algorithm while there could be much better algorithms. Second, they commit to one specific application and, more importantly, one particular modeling of the problem. Perhaps there’s a different approach to the same problem that’s more compatible with differential privacy. It’s simply impossible to rule out differential privacy as a solution through these kind of straw man experiments.</p>
<p>The differential privacy community is partially at blame for empowering the naysayers, since they have advertised differential privacy as a <em>universal</em> solution concept to the privacy problem. This is theoretically true in some sense, but the situation in practice is much more delicate. So, stop feeding the naysayers. Start presenting differential privacy as a promising technology for <em>some</em> applications but certainly not <em>all</em>.</p>
<h3 id="change-your-narrative">Change your narrative</h3>
<p>Don’t present differential privacy as a fear inducing crypto hammer
designed to obfuscate data access. That’s not what it is.
Differential privacy is a rigorous way of doing machine learning, not a way of
preventing machine learning from being done. We understand perfectly well now
that differential privacy is a stability guarantee which is fundamentally
aligned with the central goal of statistics, namely, to learn from data about the population
as a whole and not about specific individuals. This understanding perhaps wasn’t quite there
in the beginning, but it is now. Academics should from time to time come up
with a new page 1 for their papers.</p>
<h3 id="build-reliable-code-repositories">Build reliable code repositories</h3>
<p>A weakness of the differential privacy community has been the scarcity of
available high quality code. There are many academic code pieces available
by emailing someone, but we don’t have many visible repositories on github
or elsewhere that provide robust implementations of common differentially
private algorithms. Frank McSherry’s <a href="http://research.microsoft.com/en-us/projects/pinq/">PINQ</a> was a really wonderful step in the right direction,
but it is no longer maintained and by now out of date. Written in C#, it hasn’t been easy for many to build on and extend PINQ. A more recent notable effort is the <a href="https://github.com/ejgallego/dualquery">Dual Query</a> code though it requires CPLEX to run.</p>
<p>What scares me a bit is that even a project as solidly designed and carefully executed as PINQ
did not address low-level implementation issues such as <a href="http://dl.acm.org/citation.cfm?id=2382264">floating point vulnerabilities</a> in differential privacy.</p>
<p>I’m guilty myself. Many have used or tried to use <a href="http://papers.nips.cc/paper/4548-a-simple-and-practical-algorithm-for-differentially-private-data-release.pdf">MWEM</a>, an algorithm Katrina Ligett, Frank McSherry and I presented at NIPS a few years ago. Yet we don’t have a great implementation publicly available. You can email us for a decent C# implementation (alas!), but instead a lot of people have
produced their own implementations of our algorithm over the years. I regularly have the urge to start an open source project for it, but then I realize it’s a bit of a bottomless pit. In order to have a solid implementation of MWEM, I’d first need to have a solid implementation of all the primitives with all the low-level issues that come up. In any case, if somebody more brave then myself took the first step on an open source effort (preferably not in C#), I’d be very eager to contribute.</p>
<p>Taking a more modest step, I feel compelled to compile a list of available code repositories.
If you have any pointers, please leave a comment!</p>
<h3 id="be-less-general-and-more-domain-specific">Be less general and more domain-specific</h3>
<p>Much of the academic research on differential privacy has focused on generality. That makes sense theoretically, but it means that reading the scientific literature on differential privacy from the point of
view of a domain expert can be very frustrating. Most papers start with toy
examples that make perfect sense on a theoretical level, but will appear
alarmingly naïve to a domain expert.</p>
<p>The community is at a point where we need to transition <em>from generality to specificity</em>.
For example, what’s needed are domain-specific tutorials
that walk practitioners through real examples. One reason why such
tutorials don’t exist is that they take a lot of time and writing them isn’t
incentivized by academia. One way out of this is for journal editors and
conferences to specifically invite such tutorials. Similarly, the community should at this point have very high regard for positive results and case studies in specific application domains even if they are limited in scope and don’t contribute technically new solutions.</p>
<h3 id="be-more-entrepreneurial">Be more entrepreneurial</h3>
<p>The CPUC case highlighted that the application of differential privacy in
practice can fail as a result of many non-technical issues. These important
issues are often not on the radar of academic researchers. We spent an awful
lot of time talking about the technical strengths or limitations of
differential privacy, while missing out on some very real challenges. It’s quite reasonable to argue
that these challenges should be outside the scope of academia. On the other hand, academics are currently the
only available experts on differential privacy and there’s obvious demand for it.
Where should we draw the line?</p>
<p>To be blunt, I think an important ingredient that’s missing in the current differential
privacy ecosystem is <em>money</em>. There is only so much that academic researchers can do to promote a technology.
Beyond a certain point businesses have to commercialize the technology for it be successful. The CPUC
case was much better suited as the full-time job for a group of paid professionals
rather than a volunteering effort. I’m surprised none of the researchers working on differential privacy
have devoted a sabbatical to running a privacy startup. It’s needed and the potential upside is big.
Why not give it a shot? I hear tenured jobs are meant for running startups.</p>
<h3 id="so-is-differential-privacy-practical">So, is differential privacy practical?</h3>
<p>I like the answer Aaron Roth gave when I asked him:</p>
<div style="text-align:center;">
<em>It's within striking distance.</em>
</div>
Fri, 13 Mar 2015 22:30:00 +0000
http://blog.mrtz.org/2015/03/13/practicing-differential-privacy.html
http://blog.mrtz.org/2015/03/13/practicing-differential-privacy.htmltcsdifferential privacypracticeCompeting in a data science contest without reading the data<p>Machine learning competitions have become an extremely popular format for
solving prediction and classification problems of all sorts. The most famous
example is perhaps the Netflix prize. An even better example is
<a href="http://www.kaggle.com">Kaggle</a>, an awesome startup that’s
organized more than a hundred competitions over the past few years.</p>
<p>The central component of any competition is the public leaderboard. Competitors can repeatedly submit a list of predictions and see how their predictions perform on a set of <em>holdout labels</em> not available to them. The leaderboard ranks all teams according to their prediction accuracy on the holdout labels. Once the competition closes all teams are scored on a final test set not used so far. The resulting ranking, called private leaderboard, determines the winner.</p>
<p><img src="/assets/heritage-pub.jpg" alt="Heritage Prize public leaderboard" /></p>
<div style="text-align:center;margin-bottom:10px">
Public leaderboard of the Heritage Health Prize (<a href="http://www.heritagehealthprize.com/c/hhp/leaderboard/public">Source</a>)
</div>
<p>In this post, I will describe a method to climb the public leaderboard <em>without even looking at the data</em>. The algorithm is so simple and natural that an unwitting analyst might just run it. We will see that in Kaggle’s famous Heritage Health Prize competition this might have propelled a participant from rank around 150 into the top 10 on the public leaderboard without making progress on the actual problem. The Heritage Health Prize competition ran for two years and had a prize pool of 3 million dollars. Keep in mind though that the standings on the public leaderboard do not affect who gets the money.</p>
<p>The point of this post is to illustrate why maintaining a leaderboard that accurately reflects the true performance of each team is a difficult and deep problem. While there are decades of work on estimating the true performance of a model (or set of models) from a finite sample, the leaderboard application highlights some
challenges that while fundamental have only recently seen increased attention. A follow-up post will describe a <a href="http://arxiv.org/abs/1502.04585">recent paper</a> with Avrim Blum that gives an algorithm for maintaining a (provably) accurate public leaderboard.</p>
<p>Let me be very clear that my point is <em>not</em> to criticize Kaggle or anyone else organizing machine learning competitions. On the contrary, I’m amazed by how well Kaggle competitions work. In my opinion, they have contributed a tremendous amount of value to both industry and education. I also know that Kaggle has some very smart people thinking hard about how to anticipate problems with competitions.</p>
<h2 id="the-kaggle-leaderboard-mechanism">The Kaggle leaderboard mechanism</h2>
<p>At first sight, the Kaggle mechanism looks like the classic <em>holdout method</em>. Kaggle partitions the data into two sets: a training set and a holdout set. The training set is publicly available with both the individual instances and their corresponding class labels. The instances of the holdout set are publicly available as well, but the class labels are withheld. Predicting these missing class labels is the goal of the participant and a valid submission is a list of labels—one for each point in the holdout set.</p>
<p>Kaggle specifies a score function that maps a submission consisting of N labels to a numerical score, which we assume to be in <script type="math/tex">[0,1]</script>. Think of the score as prediction error (smaller is better). For concreteness, let’s fix it to be the <em>misclassification rate</em>. That is a prediction incurs loss 0 if it matches the corresponding unknown label and loss 1 if it does not match it. We divide by the number of predictions to get a score in <script type="math/tex">[0,1]</script>.</p>
<p>Kaggle further splits its \(N\) private labels randomly into \(n\) holdout labels and \(N-n\) test labels. Typically, \(n=0.3N\). The public leaderboard is a sorting of all teams according to their score computed only on the \(n\) holdout labels (without using the test labels), while the private leaderboard is the ranking induced by the test labels. I will let \(s_H(y)\) denote the public score of a submission \(y\), i.e., the score according to the public leaderboard. Typically, Kaggle rounds all scores to 5 or 6 digits of precision.</p>
<h2 id="the-cautionary-tale-of-wacky-boosting">The cautionary tale of wacky boosting</h2>
<p>Imagine your humble blogger in a parallel universe: I’m new to this whole machine learning craze. So, I sign up for a Kaggle competition to get some skills. Kaggle tells me that there’s an unknown set of labels \(y\in\{0,1\}^N\) that I need to predict. Well, I know nothing about \(y\). So here’s what I’m going to do. I try out a bunch of random vectors and keep all those that give me a slightly better than expected score. If we’re talking about misclassification rate, the expected score of a random binary vector is 0.5. So, I’m keeping all the vectors with score less than 0.5. Then I recall something about boosting. It tells me that I can boost my accuracy by aggregating all predictors into a single predictor using the majority function. Slightly more formally, here’s what I do:</p>
<p><strong>Algorithm</strong> (Wacky Boosting):</p>
<ol>
<li>Choose \(y_1,\dots,y_k\in\{0,1\}^N\) uniformly at random.</li>
<li>Let \(I = \{ i\in[k] \colon s_H(y_i) < 0.5 \}\).</li>
<li>Output \(\hat y=\mathrm{majority} \{ y_i \colon i \in I \} \), where the majority is component-wise.</li>
</ol>
<p>Lo and behold, this is what happens:</p>
<div style="text-align:center">
<object data="/assets/boosting.svg" type="image/svg+xml">
<param name="src" value="/assets/boosting.svg" />
<img src="/assets/boosting.png" />
</object>
<p style="text-align:center">In this plot, \(n=4000\) and all numbers are averaged over 5 independent repetitions.</p>
</div>
<p>As I’m only seeing the public score (bottom red line), I get super excited. I keep climbing the leaderboard! Who would’ve thought that this machine learning thing was so easy? So, I go write a blog post on Medium about Big Data and score a job at DeepCompeting.ly, the latest data science startup in the city. Life is pretty sweet. I pick up indoor rock climbing, sign up for wood working classes; I read Proust and books about espresso. Two months later the competition closes and Kaggle releases the final score. What an embarrassment! Wacky boosting did nothing whatsoever on the final test set. I get fired from DeepCompeting.ly days before the buyout. My spouse dumps me. The lease expires. I get evicted from my apartment in the Mission. Inevitably, I hike the Pacific Crest Trail and write a novel about it.</p>
<h3 id="what-just-happened">What just happened</h3>
<p>Let’s understand what went wrong and how you can avoid hiking the Pacific Crest Trail. To start out with, each \(y_i\) has loss around \(1/2\pm1/\sqrt{n}\). We’re selecting the ones that are biased below a half. This introduces a bias in the score and the conditional expected bias of each selected vector \(w_i\) is roughly \(1/2-c/\sqrt{n}\) for some positive constant \(c>0\). Put differently, each selected \(y_i\) is giving us a guess about each label in the unknown holdout set \(H\subseteq [N]\) that’s correct with probability \(1/2 + \Omega(1/\sqrt{n})\). Since the public score doesn’t depend on labels outside of \(H\), the conditioning does not affect the final test set. The labels outside of \(H\) are still unbiased. Finally, we need to argue that the majority vote “boosts” our slightly biased coin tosses into a stronger bias. More formally, we can show that \(\hat y\) gives us a guess for each label in \(H\) that’s correct with probability
\[
\frac12 + \Omega\left(\sqrt{k/n}\right).
\]
Hence, the public score of \(y\) satisfies
\[
s_H(y) < \frac12 - \Omega\left(\sqrt{k/n}\right).
\]
Outside of \(H\), however, we’re just random guessing with no advantage.
To summarize, wacky boosting gives us <em>a bias of \(\sqrt{k}\) standard deviations on the public score with \(k\) submissions</em>.</p>
<p>What’s important is that the same algorithm still “works” even if we don’t get exact answers. All we need are answers that are accurate to an additive error of \(1/\sqrt{n}\). This is important since Kaggle rounds its answers to 5 digits of precision. In particular, this attack will work so long as \(n< 10^{10}\).</p>
<h3 id="why-the-holdout-method-breaks-down">Why the holdout method breaks down</h3>
<p>The idea behind the holdout method is that the holdout data serve as a fresh sample providing an unbiased and well-concentrated estimate of the true loss of the classifier on the underlying distribution. Why then didn’t the holdout method detect that our wacky boosting algorithm was overfitting? The short answer is that the holdout method is simply not valid in the way it’s used in a competition.</p>
<p>One point of departure from the classic method is that the participants actually do see the data points corresponding to holdout labels which can lead to some problems. But that’s not the issue here and even if they we don’t look at the holdout data points at all, there’s a fundamental reason why the validity of the classic holdout method breaks down.</p>
<p>The problem is that a submission in general incorporates information about the holdout labels previously released through the leaderboard mechanism. As a result, <strong>there is a statistical dependence between the holdout data and the submission</strong>. Due to this feedback loop, the public score is in general no longer an unbiased estimate of the true score. There is no reason not to expect the submissions to eventually overfit to the holdout set.</p>
<p>The problem of overfitting to the holdout set is well known. Kaggle’s forums are full of anecdotal evidence reported by various competitors. The primary way Kaggle deals with this problem is by limiting the rate of re-submission and (to some extent) the bit precision of the answers. Of course, this is also the reason why the winners are determined on a separate test set.</p>
<h3 id="static-vs-interactive-data-analysis">Static vs interactive data analysis</h3>
<p>Kaggle’s liberal use of the holdout method is just one example of a widespread disconnect between the theory of <strong>static data analysis</strong> and the practice of <strong>interactive data analysis</strong>. The holdout method is a static method in that it assumes the model to be independent of the holdout data on which it is evaluated. However, machine learning competitions are interactive, because submissions generally incorporate information from the holdout set.</p>
<p><img src="/assets/staticvsint.jpg" alt="Static vs Interactive" /></p>
<p>I contend that most real world data analysis is interactive. Unfortunately, most of the theory on model validation and statistical estimation falls into the static setting requiring independence between method and holdout data. This divide is <em>not</em> inherent though. Indeed, my next post deals with some useful theory for the interactive setting.</p>
<h2 id="the-heritage-health-prize-leaderboard">The Heritage Health Prize leaderboard</h2>
<p>Let’s see how this could’ve been applied to an actual competition. Of course, the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Health Prize</a> competition is long over. We’re about two years too late to the party. Besides, we don’t have the solution file for that competition. Without it there’s no sure way of knowing how well this approach would’ve worked. Nevertheless, we can make some reasonable modeling of what the holdout labels might look like using information that was released by Kaggle and see how well we’d be doing against our random model.</p>
<h3 id="generalized-wacky-boosting">Generalized wacky boosting</h3>
<p>Before we can apply wacky boosting to the Heritage prize, we need to clear two obstacles.
First, wacky boosting required the domain to be Boolean whereas the labels could be arbitrary positive real numbers. Second, the algorithm only gave an advantage over random guessing which might be too far from the top of the leaderboard to start out with. It turns out that both of these issues can be resolved nicely with a simple generalization of the previous algorithm. What was really happening in the algorithm is that we had two candidate solutions, the all ones vector and the all zeros vector, and we tried out random coordinate-wise combinations of these vectors. The algorithm ends up finding a coordinate wise combination of the two vectors that improves upon their mean loss, i.e., one half. This way of looking at it generalizes nicely. The resulting algorithm is just a few lines of Julia code.</p>
<figure class="highlight"><pre><code class="language-julia" data-lang="julia"><span class="c"># select coordinate from v1 where v is 1 and from v2 where v is 0</span>
<span class="n">combine</span><span class="x">(</span><span class="n">v</span><span class="x">,</span><span class="n">v1</span><span class="x">,</span><span class="n">v2</span><span class="x">)</span> <span class="o">=</span> <span class="n">v1</span> <span class="o">.*</span> <span class="n">v</span> <span class="o">+</span> <span class="n">v2</span> <span class="o">.*</span> <span class="x">(</span><span class="mi">1</span><span class="o">-</span><span class="n">v</span><span class="x">)</span>
<span class="k">function</span><span class="nf"> wackyboost</span><span class="x">(</span><span class="n">v1</span><span class="x">,</span><span class="n">v2</span><span class="x">,</span><span class="n">k</span><span class="x">,</span><span class="n">score</span><span class="x">)</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">mean</span><span class="x">([</span><span class="n">score</span><span class="x">(</span><span class="n">v1</span><span class="x">),</span><span class="n">score</span><span class="x">(</span><span class="n">v2</span><span class="x">)])</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">rand</span><span class="x">(</span><span class="mi">0</span><span class="x">:</span><span class="mi">1</span><span class="x">,(</span><span class="n">length</span><span class="x">(</span><span class="n">v1</span><span class="x">),</span><span class="n">k</span><span class="x">))</span>
<span class="c"># select columns of A that give better than mean score</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">filter</span><span class="x">(</span><span class="n">i</span> <span class="o">-></span> <span class="n">score</span><span class="x">(</span><span class="n">combine</span><span class="x">(</span><span class="n">A</span><span class="x">[:,</span><span class="n">i</span><span class="x">],</span><span class="n">v1</span><span class="x">,</span><span class="n">v2</span><span class="x">))</span> <span class="o"><</span> <span class="n">m</span><span class="x">,[</span><span class="mi">1</span><span class="x">:</span><span class="n">k</span><span class="x">])</span>
<span class="c"># take majority vote over all selected columns</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">float</span><span class="x">(</span><span class="n">A</span><span class="x">[:,</span><span class="n">a</span><span class="x">]</span> <span class="o">*</span> <span class="n">ones</span><span class="x">(</span><span class="n">length</span><span class="x">(</span><span class="n">a</span><span class="x">))</span> <span class="o">.></span> <span class="n">length</span><span class="x">(</span><span class="n">a</span><span class="x">)</span><span class="o">/</span><span class="mi">2</span><span class="x">)</span>
<span class="k">return</span> <span class="n">combine</span><span class="x">(</span><span class="n">v</span><span class="x">,</span><span class="n">v1</span><span class="x">,</span><span class="n">v2</span><span class="x">)</span>
<span class="k">end</span></code></pre></figure>
<p>I worked through the fun exercise of applying this algorithm in a separate <a href="http://nbviewer.ipython.org/gist/mrtzh/c41fd4c5897fc114a0d6">Julia notebook</a>. Here’s one picture that came out of it. Don’t treat the numbers as definitive as they all depend on the modeling assumptions I made.</p>
<div style="text-align:center">
<object data="/assets/heritage.svg" type="image/svg+xml">
<img src="/assets/heritage.png" />
</object>
<p style="text-align:center">We see an improvement from 0.462311 (rank 146) to 0.451868 (rank 6).</p>
</div>
<p>The bottom line is: It seems to work reasonably well (under various semi-principled modeling assumptions I made). From the looks of it this might have given you an improvement <em>from rank 150ish to 6ish</em> within 700 submissions. Note there was a single team with 671 submissions. There’s a pretty good gap between number one on the <a href="http://www.heritagehealthprize.com/c/hhp/leaderboard/public">public leaderboard</a> and the rest. While possible in principle, it took me a bunch more submissions to get to the top. I should say though that I used the completely generic code from above without any optimizations specific to the competition. I didn’t even look at the data points (as I don’t have them). It’s possible that using the data and domain knowledge could improve things much further. I chose the Heritage Health prize, because it was the highest prized Kaggle competition ever (3 million dollars) and it ran for two years with a substantial number of submissions.</p>
<h2 id="how-robust-is-your-benchmark">How robust is your benchmark?</h2>
<p>There’s a broad lesson to be learned from this example. As computer scientists we love numerical benchmarks and rankings of all sorts. They look so objective and scientific that we easily forget how any benchmark is just a proxy for a more complex question. Every once in a while we should step back and ask: How robust is the benchmark? Do improvements in our benchmark really correspond to progress on the original problem? What I’d love to see in all empirical branches of computer science are adversarial robustness evaluations of various benchmarks. How far can we get by <em>gaming</em> rather than by actually making progress towards solving the problem?</p>
<p>Let me end on a positive note. What excites me is that the serious problems we saw in this post actually do have a fix (both in theory and in practice)! So, stay tuned for my next post.</p>
<p><em>Subscribe to the <a href="http://blog.mrtz.org/feed.xml">RSS feed</a>
or follow me on <a href="https://www.twitter.com/mrtz">Twitter</a>.</em></p>
Mon, 09 Mar 2015 17:30:00 +0000
http://blog.mrtz.org/2015/03/09/competition.html
http://blog.mrtz.org/2015/03/09/competition.htmltcsGoodbye Wordpress, never again<p>Some may have noticed that this blog was an utter mess for about two weeks.
This was due to an exploited vulnerability in Wordpress leaving my blog
in a state most curiously difficult to recover from. The only thing not
corrupted by this hack was an SQL database dump of my blog
that turned out to be inconsistent with a fresh Wordpress install.</p>
<p>The Wordpress support team thankfully pointed me to a number of videos
explaning carefully how to move my mouse
cursor from one point to another. I feel a lot more confident now with the mouse
cursor. In the
end, I more or less manually moved my blog to <a href="http://jekyllrb.com">Jekyll</a> hosted by
<a href="https://pages.github.com">GitHub</a> with <a href="http://disqus.com">Disqus</a> comments.
It’s a pretty decent combination that a lot of people seem to be transitioning
to.</p>
<p>If you ever think about starting a blog, <em>do not</em> go with Wordpress.
If you absolutely
cannot resist the temptation, by all means do not maintain your own Wordpress
install.</p>
<blockquote>
<p>Wordpress is slow, clunky and insecure.</p>
</blockquote>
<p>I had to install updates
multiple times a month and even that was evidently not enough to stay on top.</p>
<p>On a positive note, I appear to be on paternity leave right now—for another
five weeks or so—and I hope to write a blog post or two in those moments
when I’m not spending quality time with my bicycles, ehm, and my daughter, of
course.</p>
<p><em>To stay on top of future posts, subscribe to the (new) <a href="http://blog.mrtz.org/feed.xml">RSS feed</a>
or follow me on <a href="https://www.twitter.com/mrtz">Twitter</a>.</em></p>
Sun, 08 Feb 2015 19:45:39 +0000
http://blog.mrtz.org/2015/02/08/wordpress.html
http://blog.mrtz.org/2015/02/08/wordpress.htmltcsThe NIPS experiment<p><em>This is a guest post by <a href="http://cs.utexas.edu/~ecprice/">Eric Price</a>.</em></p>
<p>I was at NIPS (one of the two main machine learning conferences) in Montreal last week, which has a really advanced format relative to the theory conferences. The double blind reviewing, rebuttal phase, and poster+lightning talk system all seem like improvements on the standard in my normal area (theoretical computer science), and having 2400 attendees is impressive and overwhelming. But the most amazing thing about the conference organization this year was the <a href="http://inverseprobability.com/2014/12/16/the-nips-experiment/">NIPS consistency experiment</a>.</p>
<p>A perennial question for academics is how accurate the conference review and acceptance process is. Getting papers into top conferences is hugely important for our careers, yet we all have papers rejected that we think should have gotten in. One of my papers was rejected three times before getting into SODA – as the best student paper. After rejections, we console ourselves that the reviewing process is random; yet we take acceptances as confirmation that our papers are good. So just how random <em>is</em> the reviewing process? The NIPS organizers decided to find out.</p>
<h2 id="the-nips-experiment">The NIPS Experiment</h2>
<p>The NIPS consistency experiment was an amazing, courageous move by the organizers this year to quantify the randomness in the review process. They split the program committee down the middle, effectively forming two independent program committees. Most submitted papers were assigned to a single side, but 10% of submissions (166) were reviewed by <em>both</em> halves of the committee. This let them observe how consistent the two committees were on which papers to accept. (For fairness, they ultimately accepted any paper that was accepted by either committee.)</p>
<p>The results were revealed this week: of the 166 papers, the two committees disagreed on the fates of 25.9% of them: 43. [Update: the original post said 42 here, but I misremembered.] But this “25%” number is misleading, and most people I’ve talked to have misunderstood it: it actually means that the two committees <em>disagreed more than they agreed</em> on which papers to accept. Let me explain.</p>
<p>The two committees were each tasked with a 22.5% acceptance rate. This would mean choosing about 37 or 38 of the 166 papers to accept<sup><a href="#footnotes">1</a></sup>. Since they disagreed on 43 papers total, this means one committee accepted 21 papers that the other committee rejected and the other committee accepted 22 papers the first rejected, for 21 + 22 = 43 total papers with different outcomes. Since they accepted 37 or 38 papers, this means they disagreed on 21/37 or 22/38 ≈ 57% of the list of accepted papers.</p>
<p>In particular, about 57% of the papers accepted by the first committee were rejected by the second one and vice versa. In other words, most papers at NIPS would be rejected if one reran the conference review process (with a 95% confidence interval of 40-75%):</p>
<p><img src="/assets/nips-pie11.png" alt="Nips pie chart" /></p>
<p style="text-align: center">Most papers accepted by one committee were rejected by the other, and vice versa.</p>
<p>This result was surprisingly large to most people I’ve talked to; they generally expected something like 30% instead of 57%. Relative to what people expected, 57% is actually closer to a purely random committee, which would only disagree on 77.5% of the accepted papers on average:</p>
<p><img src="/assets/randomgraph2.png" alt="Random model" /></p>
<p style="text-align: center">
If the committees were purely random, at a 22.5%
acceptance rate<br /> they would disagree on 77.5% of their acceptance lists on average.</p>
<p>In the next section, I’ll discuss a couple simple models for the conference review process that give the observed level of randomness.</p>
<h2 id="models-for-conference-acceptance">Models for conference acceptance</h2>
<p>One rough model for paper acceptance, consistent with the experiment, is as follows:</p>
<ol>
<li>Half the submissions are found to be poor and reliably rejected.</li>
<li>The other half are accepted based on an unbiased coin flip.</li>
</ol>
<p>This might be a decent rule of thumb, but it’s clearly missing something: some really good papers do have a chance of acceptance larger than one half.</p>
<h3 id="the-messy-middle-model">The “messy middle” model</h3>
<p>One simple extension to the above model is the “messy middle” model, where some papers are clear accepts; some papers are clear rejects; and the papers in the middle are largely random. We can compute what kinds of parameters are consistent with the NIPS experiment. Options include:</p>
<ol>
<li><strong>The above model.</strong> Half the papers are clear rejects, and everything else is random.</li>
<li><strong>The opposite.</strong> 7% of all papers (i.e. 30% of accepted papers) are clear accepts, and the other 93% are random.</li>
<li><strong>Somewhere in the middle.</strong> For example 6% of all papers (i.e. 25% of accepted papers) are clear accepts, 25% of submitted papers are clear rejects, and the rest are random.</li>
</ol>
<p><img src="/assets/messymiddle.png" alt="Messy middle" /></p>
<h3 id="the-noisy-scoring-model">The “noisy scoring” model</h3>
<p>As I was discussing this over dinner, Jacob Abernethy proposed a “noisy scoring” model based on his experience as an area chair. Each paper typically gets three reviews, each giving a score on 0-10. The committee uses the average score<sup><a href="#footnotes">2</a></sup> as the main signal for paper quality. As I understand it, the basic committee process was that almost everything above 6.5 was accepted, almost everything below 6 was rejected, and the committee mainly debated the papers in between.</p>
<p>A basic simplified model of this would be as follows. Each paper has a “true”
score \(v\) drawn from some distribution (say, \({N(0,
\sigma_{between}^2)}\)), and the the reviews for the paper are drawn
from \(N(v, \sigma_{within}^2)\). Then the NIPS experiment’s
result (number of papers in which the two committees disagree) is a function
of the ratio \(\sigma_{between}/\sigma_{within}\). We find that
the observation would be consistent with this model if
\(\sigma_{within}\) is between one and four times \(\sigma_{between}\):</p>
<p><img src="/assets/noisyscoring2.png" alt="Noisy scoring" /></p>
<p>Once the NIPS review data is released, we can check the empirical
\({\sigma_{within}}\) and \({\sigma_{between}}\) to see if this model is reasonable.</p>
<p>One nice thing about the noisy scoring model is that you don’t actually need to run the NIPS experiment to estimate the parameters. Every CS conference could measure the within-paper and between-paper variance in reviewer scores. This lets you measure the expected randomness in the results of the process, assuming the model holds.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Computer science conference acceptances seem to be more random than we had previously realized. This suggests that we should rethink the importance we give to them in terms of the job search, tenure process, etc.</p>
<p>I’ll close with a few final thoughts:</p>
<ul>
<li>Consistency is not the only goal. Double-blind reviewing probably decreases consistency by decreasing the bias towards established researchers, but this is a good thing and the TCS conferences should adopt the system.</li>
<li>Experiments are good! As scientists, we ought to do more experiments on our processes. The grad school admissions process seems like a good target for this, for example.</li>
<li>I’d like to give a <em>huge</em> shout-out to the NIPS organizers, Corinna Cortes and Neil Lawrence, for running this experiment. It wasn’t an easy task – not only did they review 10% more papers than necessary, they also had the overhead of finding and running two independent PCs. But the results are valuable for the whole computer science community.</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<ol>
<li>The committees did not know which of the ~900 papers they were reviewing were the 166 duplicated ones, so there can be some variation in how many papers to accept, but this is a minor effect.</li>
<li>They also use a “confidence-weighted” score, but let’s ignore that detail.</li>
</ol>
Mon, 15 Dec 2014 19:45:39 +0000
http://blog.mrtz.org/2014/12/15/the-nips-experiment.html
http://blog.mrtz.org/2014/12/15/the-nips-experiment.htmltcsHow big data is unfair<p>Head on over to <a href="https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de">Medium</a>
for a non-technical general audience piece I wrote on <em>why machine learning is not, by
default, fair or just in any meaningful way.</em></p>
<p>Since I first wrote this post there’s been some interesting follow up by
<a href="https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d">Hanna
Wallach</a>.
Also be sure to check out the web site of the <a href="http://www.fatml.org">NIPS workshop</a> on fairness,
accountability and transparency that Solon Barocas and I organized.</p>
Sat, 27 Sep 2014 01:02:37 +0000
http://blog.mrtz.org/2014/09/27/how-big-data-is-unfair.html
http://blog.mrtz.org/2014/09/27/how-big-data-is-unfair.html