What should a theory of big data say?
I checked out my new apartment in Berkeley a few days ago. My very senior landlord asked me to explain to him what it is that I do. So, I said: "I'm a computer scientist." He gave me an empty stare as if he had never heard of such a thing. I tried again: "I work on algorithms." Nothing. Not a sign of engagement. Fine, I thought to myself, here's one last attempt: "Well, the program I'm attending at Berkeley is on what they call Big Data". His eyes lit up. He smiled and replied without a moment of hesitation: "Big Data! Why didn't you just say that?" He went on to tell me all about what Big Data is going to do in the next years, how important it is, how it is going to advance medical research and lead to new scientific discovery. This is really only a case in point. The words "big data" have apparently become more evocative to the greater public than either of the terms "algorithm" or "computer science".
At any rate, the program I'm attending, called "The Theoretical Foundations of Big Data", is an incredibly promising endeavor of the new Simons Institute for Theoretical Computer Science. As already featured by Suresh, the new institute is amazing. It has outside blackboards (at least soon enough), plenty of collaboration space, and most importantly, a really fun group of visitors. The fact that it's located in Berkeley sure doesn't hurt either. Suresh, Andrew and I will try to cover some of the exciting events. So, check back regularly, if you're interested.
I want to start out by asking the obvious question. What is it that the theoretical foundations of Big Data should be about? A naïve answer is that Big Data is all about making stuff faster. So, we should be talking about the theory of making stuff faster. There's nothing wrong with making stuff faster. It's important. No doubt. Nevertheless I think a lot is lost if we take this as our narrow interpretation of Big Data. Computer Scientists, theorists and non-theorists alike, have been making stuff faster in various models of computation for ages. It's not that we've only come around to this now that there's a need to analyze larger data sets. The description suggested on the Simons website is certainly more nuanced. It still focuses on scalability: "The challenge is to develop the theoretical principles needed to scale inference and learning algorithms to massive, even arbitrary, scale."
I think a fundamental challenge for anything like a "theory of Big Data" lies somewhere else. It's got something to do with our choice of precedence. The classical point of view in complexity theory is that the problem comes first, the input second. We first fix a problem that we would like to study, say, 3-Coloring, TSP, Clique etc. Then we discuss algorithms for the specific problem we chose. The input and various properties that it might have are often secondary. This viewpoint has fundamentally shaped the research culture. Algorithms are often carefully tailored over the course of many decades to a particularly important problem such as 3-Coloring.
Perplexingly, the viewpoint perpetuated by "Big Data" is exactly the opposite. The input, the data set, is the protagonist. Data gets all the attention. The problem that needs to be solved is negotiable. You rarely hear a data analyst tell you exactly which problem they need to solve. Even if you poke them, they'll rarely give you a precise answer. Try it out! As theoreticians we often suspect the issue is that those guys simply haven't formalized their problems yet. That's partially true. The more fundamental reason though is that the problem genuinely depends on the data set. Let me make this point more clear through an example. Suppose a scientist collects a large set of data records about cancer patients. If my landlord is to be trusted, and for various reasons I hope he is, the goal now is scientific discovery. There are many ways to go about it. There are many ways to explore the data set. The analyst will use her domain knowledge and the specifics of the data set to make an initial guess. A first try might be linear regression (which is of course by itself well-defined). If that doesn't work, she might try logistic regression, and then any of thousands of possible methods. Of course, there is the danger that as more and more methods are evaluated against the data set, any observed pattern is increasingly likely to be a result of over-fitting. Regardless, the problem that the scientist will have solved at the end of the day is ill-defined. More importantly, even if there were a definition of the problem it would inevitably depend on the data set. It is unlikely that a different medical study would want to solve precisely the same problem.
This situation should fill any theorist with discomfort. If the input determines the problem that we need to solve, what problem then should we study? Or, rather, from the point of view of the data analyst, what guidance does our theory provide in choosing one approach over another? I'm not suggesting that we should tailor more theory around what we currently perceive as practical. There's a good reason to be a step removed from practice. However, there are two legitimate ways of deciding the precedence of problem versus data. In theory we chose one over the other and moved on.
Similar questions came up in the Stanford workshop "Beyond Worst-Case Complexity". I vividly remember the workshop as one of my favorite workshops of all times. It's true that Big Data is also at odds with worst-case analysis. As a starting point, theory folks could try to figure out what the properties are that large data sets exhibit and how they might affect algorithm design. This research direction has already gained some momentum. I still feel like this doesn't address the heart of the problem described above. Average-case complexity doesn't attempt to reverse the role of problem and data. It only advocates thinking about properties of the input for a specific problem that are typically true and might make the problem easier.
Another perspective was offered by Prabhakar Raghavan's invited talk at STOC 2013. Interestingly, he described Big Data as bad news for algorithm design. His point was that sophisticated algorithms have been replaced by less clever algorithms run on larger data sets. My view is quite a bit more optimistic. I consider Big Data as an opportunity for algorithm design to revisit some of its theoretical foundations.
Update (8/27): Suresh posted an interesting response to this post.