Is differential privacy practical?
Computer scientists like to equate practicality with computational efficiency. I plead guilty. If only an algorithm runs fast enough, it will be useful in practice. The situation is often much more complicated. Sure enough, the machine model matters, simplicity matters, constants matter, but that's not what I'm getting at. In many cases practicality depends a lot on the legal and social environment in which the algorithm is to be applied. In this post I will start to discuss the question: Is Differential Privacy practical---in the icky non-computational sense of the word? On another occasion I might talk about whether Differential Privacy is practical in our beloved computational sense of the word. That, too, is an important and exciting question.
Differential Privacy is a formal notion of what it means for an algorithm to be privacy-preserving. It was invented jointly by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 2006. Since then it has turned out that much of statistical data analysis is still feasible even if we insist on differential privacy! This is a surprising insight that has been developed by a diverse community of researchers working on differential privacy. From a theoretical perspective, differential privacy is a success story already in my opinion. The definition (unlike many others) has held up to close scrutiny over the years (and there is a healthy ongoing discussion). There are many beautiful results relating differential privacy to other areas of computer science and math, including convex geometry, discrepancy theory and especially learning theory, just to name a few. These connections have lead to much insight over the years and continue to be an exciting research area.
It's no question that there hasn't been nearly as much progress in applying differential privacy in non-academic settings. Sure enough there is a lot of empirical work on differential privacy within academia. I certainly don't belittle its importance. I will focus though on attempts at applying differential privacy in a completely non-academic environment. I've come to regard this as one of the most important directions for differential privacy to push into at this point. I find it so important that I'm willing to drop pen and paper for it and get my hands dirty. As a theory guy, I have pretty ambivalent feelings about it. This point was also made by Salil Vadhan during a recent Simons workshop on Differential Privacy. On the one hand, there is currently an almost urgent need for privacy-preserving solutions in the world. On the other hand, many of the experts in the field have a theoretical inclination and would prefer to work on the science. Who is there to meet the demand for practical solutions? Who is there to educate the non-experts? Who is going to convince corporate lawyers that Differential Privacy is a good solution? Who is going to write the software? Should theoreticians (or more broadly academics) do any of this? If so, this feels like an uncertain departure from our day-to-day academic work. It most likely won't result in another conference publication. Yet, if the experts don't join this effort, I fear, much of the potential impact of the field could get lost. Inferior solutions will meet the demand as industry progresses one way or the other.
In this series of blog posts I will try to explain where the difficulty is in applying differential privacy and how the community might make more progress. My narrative will be closely tied to a recent and ongoing project with the California Public Utilities Commission (CPUC). The CPUC is a regulatory body that oversees privately held electric, natural gas and water companies in California. The CPUC is currently working on a proposal for an energy data center that aims to facilitate the access to household-level smart meter readings by third parties. I was asked by the Electronic Frontier Foundation (EFF) to assist as a privacy expert within the proceeding.
Energy data might not be the first thing that comes to mind when you think of privacy concerns. Yet, it is well understood that smart meter data can be highly revealing. There is a large body of research devoted to drawing inferences from smart meter readings. At high enough resolution (though within the limits of current smart meters), for example, an algorithm could infer which television show you are watching. Inferring which TV set you own is an easier task. Even at much lower resolution (15 minute interval data), larger appliances (e.g., medical appliances) can feasibly be identified. Various personal habits are visible. Do you wake up at night? How often do you use the bathroom? At the same time, energy data has high public value. Research on energy data can lead to a smarter grid, a more optimized energy system that is ultimately cheaper for the consumer. There is a lot of political pressure in California to facilitate such research. The energy data center project grew out of this desire.
I should say that even though I work on privacy-preserving algorithms, I'm solidly on the side of the data analyst. I want to enable research through statistical data analysis rather than prevent it. To me differential privacy has the unambiguous goal of enabling more insight in environments that are plagued by privacy concerns. It's an important thing to keep in mind. Privacy researchers are all too often confused for some sort of morality police that asks you to turn down the music and stop the party at 9pm.
In keeping with my decision to produce short posts, I will end this post here. In my next post in this series, I'll introduce the parties involved in the CPUC proceeding, summarize what's happened so far and describe how it has informed my understanding of practicing differential privacy.
To follow future posts, subscribe to the RSS feed or follow me on Twitter.