Research Vignette: Foundations of Data Science

by Ilias Diakonikolas (University of Southern California), Santosh Vempala (Georgia Institute of Technology), and David P. Woodruff (Carnegie Mellon University)

Algorithmic High-Dimensional Robust Statistics
Fitting a model to a collection of observations is one of the quintessential goals of statistics and machine learning. A major recent advance in theoretical machine learning is the development of efficient learning algorithms for various high-dimensional models, including Gaussian mixture models, independent component analysis, and topic models. The Achilles’ heel of these algorithms is the assumption that data is precisely generated from a model of the given type.

This assumption is crucial for the performance of these algorithms: even a very small fraction of outliers can completely compromise the algorithm’s behavior. For example, k adversarially placed points can completely alter the k-dimensional principal component analysis, a commonly used primitive for many algorithms. However, this assumption is only approximately valid, as real data sets are typically exposed to some source of contamination. Moreover, the data corruption is often systematic, and random models do not accurately capture the nature of the corruption. Hence, it is desirable that any estimator that is to be used in practice is stable in the presence of arbitrarily and adversarially corrupted data.

Indeed, the problem of designing outlier-robust estimators is natural enough that it is studied by a classical body of work, namely robust statistics, whose prototypical question is the design of estimators that perform well in the presence of corrupted data. This area of statistics was initiated by the pioneering works of Tukey and Huber in the 1960s and addresses a conceptual gap in Fisher’s theory of exact parametric inference — since parametric models are typically only approximately valid, robust statistics is essential to complete the theory. From a practical perspective, the question of how to make good inferences from data sets in which pieces of information are corrupted has become a pressing challenge. Specifically, the need for robust statistics is motivated by data poisoning attacks, in the context of adversarial machine learning, as well as by automatic outlier removal for high-dimensional data sets from a variety of applications.

Classical work in robust statistics pinned down the fundamental information-theoretic aspects of high-dimensional robust estimation, establishing the existence of computable and information-theoretically optimal robust estimators for fundamental problems. In contrast, until very recently, the computational complexity aspects of robust estimators were poorly understood. In particular, even for the basic problem of robustly estimating the mean of a high-dimensional data set, all known robust estimators were hard to compute (i.e., computationally intractable). In addition, the accuracy of the known efficient heuristics degrades quickly as the dimension increases. This state of affairs prompted the following natural question: can we reconcile robustness and computational efficiency in high-dimensional estimation?

CONTINUE READING

Research Vignette: Lower Bounds in Computational Complexity

by Rahul Santhanam (University of Oxford)

Computational complexity theory studies the possibilities and limitations of algorithms. Over the past several decades, we have learned a lot about the possibilities of algorithms. We live today in an algorithmic world, where the ability to reliably and quickly process vast amounts of data is crucial. Along with this social and technological transformation, there have been significant theoretical advances, including the discovery of efficient algorithms for fundamental problems such as linear programming and primality.

We know far less about the limitations of algorithms. From an empirical point of view, certain important problems seem inherently hard to solve. These include the satisfiability problem in logic, the Traveling Salesman Problem in graph theory, the integer linear programming problem in optimization, the equilibrium computation problem in game theory, and the protein folding problem in computational biology. What all these problems have in common is that it is easy to verify if a given solution is correct, but it seems hard to compute solutions. This phenomenon is encapsulated in the celebrated NP vs. P question, which asks if all problems with solutions verifiable in polynomial time (NP) can also be solved in polynomial time (P). The NP vs. P problem is important both mathematically, as evidenced by its inclusion in the list of Millennium Prize Problems by the Clay Mathematics Institute, and scientifically, since natural problems that arise in a variety of scientific contexts are in NP but not known to be in P. To make progress on NP vs. P and related questions, we need to show complexity lower bounds (i.e., prove that a given computational problem cannot be solved efficiently).

Complexity lower bounds are interesting for many reasons. First, from a pragmatic point of view, they map the boundaries of what is possible and, hence, save us from wasting our efforts on finding efficient solutions to problems beyond those boundaries. Second, lower bounds can be exploited algorithmically to design secure cryptographic protocols and efficiently convert randomized algorithms into deterministic ones. Thus, intriguingly, proving new limits on the power of computation also opens up new possibilities for computation! Third, and perhaps most importantly, lower bounds provide a deeper understanding of the nature of computation. An efficient solution to an algorithmic problem tells us something new about that specific problem, while a lower bound often tells us something new about the computational model and hence about algorithms in general.

The Fall 2018 Simons Institute program on Lower Bounds in Computational Complexity gathered together researchers working on different aspects of complexity lower bounds, in Boolean, algebraic, and interactive settings, with the goal of making progress on the major open problems concerning lower bounds. In some of these cases, such as the setting of interactive computation, good lower bounds are known for various problems. In other cases, such as the case of general Boolean circuits, very little is known. In the remainder of this article, I will briefly survey what is known about Boolean circuit lower bounds and describe a phenomenon called hardness magnification that provides some new insights. This line of work was developed partly during the course of the Lower Bounds program with my collaborators Igor Oliveira and Ján Pich.

CONTINUE READING