Research Vignette: Generalization and Interpolation

by Daniel Hsu

The practice of deep learning has attracted new attention to several basic questions in statistical machine learning. One such question is how fitting machine learning models to relatively small “training” data sets can lead to accurate predictions on new data. In machine learning jargon, this is the question of generalization.

The conventional wisdom in machine learning offers the following about generalization:

  1. A model that is too simple will underfit the true patterns in the training data, and thus, it will predict poorly on new data.
  2. A model that is too complicated will overfit spurious patterns in the training data; such a model will also predict poorly on new data.

Consequently, one should choose a model that balances these concerns of underfitting and overfitting [30]. A textbook corollary is that a model that exactly fits — i.e., interpolates — the training data “is overfit to the training data and will typically generalize poorly” [17].

Recent machine learning practice appears to eschew this conventional wisdom in a dramatic fashion. For instance, a common starting point for training a neural network is to find a model that exactly fits the training data [27]. (Typically, the model is subsequently fine-tuned using different criteria, but the starting point is already nontrivial.) While this may be difficult or impossible with small neural networks, it becomes substantially easier after making the network very large. It is remarkable that this practice is compatible with generalization, despite the concerns about overfitting.

This apparent contradiction of conventional wisdom is not new. In the last century, machine learning practitioners were already aware of the efficacy (and advantages) of using “oversized” neural networks where the number of model parameters exceeds the number of training data [9, 20]. Similar observations were made about large ensembles of decision trees: it was found that increasing the number of decision trees in an ensemble could improve accuracy, even beyond the point at which the ensemble exactly fit the training data [28]. These empirical observations motivated the development of a new theory that greatly sharpens the existing body of knowledge in statistical machine learning [1, 28], and this theory has been recently extended to deep neural networks [2, 15, 26].

Figure 1: Leo Breiman’s list of important questions regarding neural networks (reproduced from [9])

However, the new theory does not fully explain recent observations about interpolating neural networks. First, in more recent experiments, neural networks are trained to interpolate noisy data [34]. By noisy data, we mean data in which the prediction targets (labels) may be erroneous. In fact, experiments have been conducted in which noise is deliberately injected in the data by assigning random labels to subsets of the training data. Neural networks that interpolate these data were nevertheless found to have nontrivial accuracy on new data. Second, similar phenomena have been observed in the context of predicting real numerical targets. (The aforementioned theoretical and empirical studies mostly focused on classification targets.) When predicting real numbers, it is almost always a given that training data will have noisy labels, so the thought of interpolating training data seems even more preposterous. But again, in recent experimental studies, interpolating (or nearly interpolating) models were found to have nontrivially accurate predictions on new data [4, 7].

CONTINUE READING

Research Vignette: The Many Dimensions of High-Dimensional Expanders

by Prahladh Harsha

An emerging theme in several disciplines of science over the last few decades is the study of local-to-global phenomena. Let me elucidate with a few examples. In biology, one tries to understand the global properties of an organism by studying local interactions at a cellular level. The Internet graph is impossible to predict or control at a global level; what one can at best do is understand its behavior by making changes at a very local level. Moving to examples closer to theoretical computer science, the pioneering works in computation theory due to Turing and others define the computation of any global function as one that can be broken down into a sequence of local computations. The seminal work on NP-completeness due to Cook, Levin, and Karp demonstrates that any verification task can be in fact reduced to a conjunction of verifying very local objects.

Given these examples, an intriguing question that arises in multiple disciplines is to identify which objects support such local-to-global phenomena. This question as stated is an ill-formed one, but one can attempt to make it well-defined in a variety of contexts. Let me illustrate by giving three specific instances.

Mixing time in graphs: Which graphs have the property of having a global mixing property that can be inferred by understanding the mixing properties of several related smaller graphs?

Local testing and decoding (in coding theory): Which error-correcting codes have the property of being able to be tested or decoded by looking at the code word or received word very locally (i.e., at very few locations)?

Topologically expanding graphs: Which graphs or hypergraphs have the property that their global topological property can be inferred by their local topological behavior?

These questions, seemingly disparate and arising in very different contexts and communities (in particular, approximate sampling, coding theory, and topology), surprisingly all led to the investigation of a similar expanding object: the high-dimensional expander. The Simons Institute 2019 summer cluster on Error-Correcting Codes and High-Dimensional Expansion brought together researchers from these diverse communities to study high-dimensional expanders and their potential applications to mathematics and theoretical computer science.

What is a high-dimensional expander?

High-dimensional expanders (HDXs) are a high-dimensional analogue of expander graphs. An expander graph, loosely speaking, is an extremely well-connected graph. Analytically, this is best captured via the second-largest eigenvalue (in absolute value) of the normalized adjacency matrix of the graph. More precisely, a graph G is said to be a λ-expander (for λ ∈ [0,1]) if the second-largest eigenvalue (in absolute value) of the normalized adjacency matrix AG of G is bounded above by λ. The smaller the λ, the better the expander. An equivalent definition is in terms of how well the random walk on the vertices induced by the graph mixes: G is a λ-expander if the spectral norm of the difference of the normalized adjacency matrix AG and the normalized all-ones matrix J (the normalized adjacency matrix of the complete graph) is bounded above by λ (i.e., \[\left\| A_G – \frac1n J \right\| \leq \lambda,\]

where n is the number of vertices of the graph G).

CONTINUE READING

Research Vignette: Geometry of Polynomials

by Shayan Oveis Gharan

Classical theorems in theoretical computer science typically heavily exploit combinatorial or probabilistic tools. Later on, linear algebraic and geometric tools proved to be fundamental elements in numerous algorithmic or hardness results. In this research vignette, I will explain a new tool known as the “polynomial paradigm,” which can be seen as a bridge between linear algebraic and probabilistic methods; and I will discuss some unexpected applications in computer science. 

Given a polynomial \(p \in R\lbrack z_{1},\ldots,z_{n}\rbrack\), we can think of it in three ways: (i) coefficients of p, (ii) zeros of p, or (iii) a function that maps n-dimensional complex vectors to complex numbers. Having these three different representations, one can study what happens to one representation when we change another one, or what happens when we apply a mathematical operation such as differentiation, integration, or specialization. For many years, mathematicians studied these questions for many classes of polynomials.

Over the last two decades, a new theme that we call the “polynomial paradigm” emerged. It was observed that if the roots of the (multivariate) polynomial \(p\) have a “nice” structure in the n-dimensional complex plane, then one can study the interaction of zeros, coefficients, and function values rigorously. In several pioneering works, Borcea, Brändén, Choe, Liggett, Oxley, Sokal, and Wagner studied the class of real stable polynomials, which can be seen as the right multivariate generalization of real-rooted polynomials. A polynomial \(p\) is real stable if it has no root in the upper half of the n-dimensional complex plane. They observed that this class is closed under many operations, such as specialization, differentiation, inversion, etc.

Over the last decade, the polynomial paradigm proved to be a fruitful approach in many areas of science and mathematics. Several applications of this paradigm in theoretical computer science, combinatorics, and linear algebra were discovered, such as improved approximation algorithms for the traveling salesman problem, a purely algebraic proof of the van der Waerden conjecture, and a resolution of the notorious Kadison-Singer conjecture. The general recipe in almost all these applications is as follows: Given a discrete phenomenon, encode it in a complex multivariate polynomial with a nice zero-free region, such as a real stable polynomial. Then study this polynomial via the interplay of coefficients, zeros, and function values. Finally, translate your findings to the domain of the original problem.

In Spring 2019, with Nikhil Srivastava, we organized a semester-long program on Geometry of Polynomials at the Simons Institute, where we invited researchers across several areas of science, mathematics, and computer science to explore this rapidly evolving area. A large group tried to better understand these polynomials and their properties, while many others tried to explore further applications. In the rest of this research vignette, I will explain the journey toward the discovery of a new class of multivariate polynomials that we call completely log-concave polynomials. This discovery and their application motivated several researchers within the program to start collaborating to better understand this class of polynomials and its properties.

CONTINUE READING

Research Vignette: Foundations of Data Science

by Ilias Diakonikolas (University of Southern California), Santosh Vempala (Georgia Institute of Technology), and David P. Woodruff (Carnegie Mellon University)

Algorithmic High-Dimensional Robust Statistics
Fitting a model to a collection of observations is one of the quintessential goals of statistics and machine learning. A major recent advance in theoretical machine learning is the development of efficient learning algorithms for various high-dimensional models, including Gaussian mixture models, independent component analysis, and topic models. The Achilles’ heel of these algorithms is the assumption that data is precisely generated from a model of the given type.

This assumption is crucial for the performance of these algorithms: even a very small fraction of outliers can completely compromise the algorithm’s behavior. For example, k adversarially placed points can completely alter the k-dimensional principal component analysis, a commonly used primitive for many algorithms. However, this assumption is only approximately valid, as real data sets are typically exposed to some source of contamination. Moreover, the data corruption is often systematic, and random models do not accurately capture the nature of the corruption. Hence, it is desirable that any estimator that is to be used in practice is stable in the presence of arbitrarily and adversarially corrupted data.

Indeed, the problem of designing outlier-robust estimators is natural enough that it is studied by a classical body of work, namely robust statistics, whose prototypical question is the design of estimators that perform well in the presence of corrupted data. This area of statistics was initiated by the pioneering works of Tukey and Huber in the 1960s and addresses a conceptual gap in Fisher’s theory of exact parametric inference — since parametric models are typically only approximately valid, robust statistics is essential to complete the theory. From a practical perspective, the question of how to make good inferences from data sets in which pieces of information are corrupted has become a pressing challenge. Specifically, the need for robust statistics is motivated by data poisoning attacks, in the context of adversarial machine learning, as well as by automatic outlier removal for high-dimensional data sets from a variety of applications.

Classical work in robust statistics pinned down the fundamental information-theoretic aspects of high-dimensional robust estimation, establishing the existence of computable and information-theoretically optimal robust estimators for fundamental problems. In contrast, until very recently, the computational complexity aspects of robust estimators were poorly understood. In particular, even for the basic problem of robustly estimating the mean of a high-dimensional data set, all known robust estimators were hard to compute (i.e., computationally intractable). In addition, the accuracy of the known efficient heuristics degrades quickly as the dimension increases. This state of affairs prompted the following natural question: can we reconcile robustness and computational efficiency in high-dimensional estimation?

CONTINUE READING

Research Vignette: Lower Bounds in Computational Complexity

by Rahul Santhanam (University of Oxford)

Computational complexity theory studies the possibilities and limitations of algorithms. Over the past several decades, we have learned a lot about the possibilities of algorithms. We live today in an algorithmic world, where the ability to reliably and quickly process vast amounts of data is crucial. Along with this social and technological transformation, there have been significant theoretical advances, including the discovery of efficient algorithms for fundamental problems such as linear programming and primality.

We know far less about the limitations of algorithms. From an empirical point of view, certain important problems seem inherently hard to solve. These include the satisfiability problem in logic, the Traveling Salesman Problem in graph theory, the integer linear programming problem in optimization, the equilibrium computation problem in game theory, and the protein folding problem in computational biology. What all these problems have in common is that it is easy to verify if a given solution is correct, but it seems hard to compute solutions. This phenomenon is encapsulated in the celebrated NP vs. P question, which asks if all problems with solutions verifiable in polynomial time (NP) can also be solved in polynomial time (P). The NP vs. P problem is important both mathematically, as evidenced by its inclusion in the list of Millennium Prize Problems by the Clay Mathematics Institute, and scientifically, since natural problems that arise in a variety of scientific contexts are in NP but not known to be in P. To make progress on NP vs. P and related questions, we need to show complexity lower bounds (i.e., prove that a given computational problem cannot be solved efficiently).

Complexity lower bounds are interesting for many reasons. First, from a pragmatic point of view, they map the boundaries of what is possible and, hence, save us from wasting our efforts on finding efficient solutions to problems beyond those boundaries. Second, lower bounds can be exploited algorithmically to design secure cryptographic protocols and efficiently convert randomized algorithms into deterministic ones. Thus, intriguingly, proving new limits on the power of computation also opens up new possibilities for computation! Third, and perhaps most importantly, lower bounds provide a deeper understanding of the nature of computation. An efficient solution to an algorithmic problem tells us something new about that specific problem, while a lower bound often tells us something new about the computational model and hence about algorithms in general.

The Fall 2018 Simons Institute program on Lower Bounds in Computational Complexity gathered together researchers working on different aspects of complexity lower bounds, in Boolean, algebraic, and interactive settings, with the goal of making progress on the major open problems concerning lower bounds. In some of these cases, such as the setting of interactive computation, good lower bounds are known for various problems. In other cases, such as the case of general Boolean circuits, very little is known. In the remainder of this article, I will briefly survey what is known about Boolean circuit lower bounds and describe a phenomenon called hardness magnification that provides some new insights. This line of work was developed partly during the course of the Lower Bounds program with my collaborators Igor Oliveira and Ján Pich.

CONTINUE READING

Research Vignette: Real-Time Decision Making in Energy (RTDM-E)

by Xinbo Geng (Texas A&M University), Swati Gupta (Massachusetts Institute of Technology)Tong Huang (Texas A&M University), and Le Xie (Texas A&M University)

The electric grid serves as one of the backbone infrastructure systems that support the well-being of billions of citizens in modern society. There has been a significant transformation of energy systems over the past decade, even though these changes have largely gone unnoticed due to affordable and reliable grid services for end-users in most developed nations. The transformation on the supply side mainly manifests itself through the decarbonization of the power supply, i.e. greater use of low-carbon energy sources. As an example, a recent National Bureau of Economic Research (NBER) study released in December 2018 suggested that US power-sector emissions have decreased by 45% since 2008. This dramatic change is fundamentally driven by two factors: (1) a sharp increase in the renewable-energy portfolio; and (2) the substantial replacement of coal-fired power plants with natural gas and other resources.

Due to the supply-side transformation, in order to maintain or improve the quality of electricity services to end-users, it is necessary to make use of real-time information that is becoming readily available thanks to the proliferation of information and communication technology (ICT) infrastructures.

Decision making in the electric grid enabled by real-time information occurs at multiple time scales, which makes the problem even more interesting. At the time scale of 15 minutes or longer, a central challenge lies in how to best allocate different resources that will meet the balance of the fluctuating demand at the lowest cost. The underlying mathematical problem is typically formulated as a nonlinear multi-stage optimization problem. Another big area of investment in real-time decision making (RTDM) is in the proliferation of globally synchronized sensors called synchrophasors. Compared with conventional SCADA systems, synchrophasors offer a two-orders-of-magnitude faster sampling rate, and synchronized time stamps for electrical variables. Therefore, they offer new opportunities to detect, locate, and classify anomalies that would have not been possible to capture by conventional means, as well as opportunities to avoid blackouts.

In what follows, we elaborate on the real-time decision-making challenges and opportunities at two different time scales and two different spatial scales. The first one deals with day-ahead decisions on phase balancing in the distribution grid. The second one studies the possibility of real-time localization of anomalies such as forced oscillation in large transmission systems.

A focal area of the Spring 2018 Simons Institute program on Real-Time Decision Making was formalizing and studying various issues in the modern energy grid pertaining to efficiency, reliability and security. Among the research themes, a good amount of effort was devoted to modernizing the optimization core of grid operations [4]. In addition, by tapping into expertise in the program in combinatorial algorithms, concrete collaboration occurred at the interface between combinatorial algorithms and grid optimization. For example, electricity is typically distributed radially to address security concerns, and this translates to selecting a spanning tree in a network with desirable properties. The project below on the distribution grid grew out of frequent conversations that the RTDM program facilitated for researchers with backgrounds in optimization and in power-systems engineering.

CONTINUE READING

Research Vignette: Optimization Against Adversarial Uncertainty

by James R. Lee, University of Washington

The rise of machine learning in recent decades has generated a renewed interest in online decision-making, where algorithms are equipped only with partial information about the world, and must take actions in the face of uncertainty about the future. There are various ways to analyze how much the presence of uncertainty degrades optimization. Two of the most general models (competitive analysis and multi-arm bandits) are “probability free” (aka “worst-case,” or “adversarial”), in the sense that one can measure the performance of algorithms without making distributional assumptions on the input.

Competitive analysis

One such model goes by the name of competitive analysis, and has been studied in theoretical CS since the 1970s. Imagine, for instance, that one maintains a binary search tree (BST) with key-value pairs at the nodes. At every point in time, a key is requested, and the corresponding value is looked up via binary search. Along the search path, the algorithm is allowed to modify the tree using standard tree rotation operations.

The cost of such an algorithm is the average number of binary search steps per key-lookup over a long sequence of requests. To minimize the cost, it is beneficial for the algorithm to dynamically rebalance the tree so that frequently-requested keys are near the root. Sleator and Tarjan had this model in mind when they invented the Splay Tree data structure. Their (still unproven) Dynamic Optimality Conjecture asserts that, for every sequence of key requests, the number of operations used by the Splay Tree is within a constant factor of the number of operations used by any binary search tree, even an offline algorithm that is allowed to see the entire sequence of key requests in advance. (This is a very strong requirement! We make no assumption on the structure of the request sequence, and have to compare our performance to an algorithm that sees the entire request sequence up front.)

In the language of competitive analysis, the conjecture states that the Splay Tree algorithm is “competitive.” In general, an \(\alpha\)-competitive online algorithm is one whose total cost on every input sequence is within an \(\alpha\) factor of the cost incurred by the optimal offline algorithm that sees the whole input sequence in advance.

Multi-arm bandits

Another compelling model arising in online learning is the multi-armed bandit problem. Here, an agent has a set \(\mathcal{A}\) of feasible actions. At time \(t=1,2,3,\ldots\), the agent plays an action \(a_t \in \mathcal{A}\), and only then the cost of that action is revealed. Imagine, for instance, choosing an advertisement \(a_t \in \mathcal{A}\) to present to a user, and then learning afterward the probability it led to a sale. The goal of an algorithm in this model is to achieve a total cost (often called the loss) that is not too much worse than the best fixed action in hindsight. The gap between the two is called the regret.

CONTINUE READING

Research Vignette: Ramsey Graphs and the Error of Explicit 2-Source Extractors

by Amnon Ta-Shma, Tel Aviv University

About a century ago, Ramsey proved that in any graph of size N, there exists a monochromatic subset of size ½ log(N), i.e., it is either a clique or an independent set.1 In 1947, Erdős proved there exist graphs with no monochromatic set of size 2log(N).2 Erdős’ proof is one of the first applications of the probabilistic method. It is a simple, straightforward counting argument, and like many other counting arguments, it shows almost any graph is good without giving any clue as to any specific such graph. Erdős offered $100 for finding an explicit K-Ramsey graph (a graph where all sets of size K are not monochromatic) for K=O(log N). The best explicit construction until a few years ago was log(K)= for some constant <1.3

All of this dramatically changed last year. Cohen,4 and independently Chattopadhyay and Zuckerman,5 constructed K-Ramsey graphs with log K=poly loglog(N). Remarkably, Chattopadhyay and Zuckerman explicitly construct the stronger object of a two-source extractor. A function  is a (K,ϵ) 2-source extractor if for every two cardinality K subsets , the distribution E(A,B) (obtained by picking uniformly  and outputting E(a,b)) is ε close to uniform in the variational distance. Roughly speaking, the Ramsey graph problem is to find a (K, ε) 2-source extractor with any error ε= ε(K) smaller than 1, even allowing error that is exponentially close to 1. In contrast, The CZ construction gives a 2-source extractor with constant error. A 2-source extractor is a basic object of fundamental importance, and the previous best 2-source construction was Bourgain’s extractor, requiring log(K) = for some small constant α>0. The CZ construction is an exponential improvement over that, requiring only log(K)=polyloglog(N).

How does the CZ construction work? Roughly speaking, the first step in the CZ construction is to encode a sample from one source (say ) with a t non-malleable extractor. I do not want to formally define non-malleability here, but this essentially means that the bits of the encoded string are “almost” t-wise independent, in the sense that except for few “bad” bits, when we look at t bits, they are close to being uniform. The next step in the CZ construction is to use the sample from the second source () to sample a substring of the encoding of a. The sampling is done using an extractor and uses the known relationship between extractors and samplers.6 Finally, a deterministic function, conceptually similar to the Majority function, is applied on the bits of the substring.

The CZ construction achieves log K=polyloglog(N), where the non-explicit argument of Erdős shows log K=loglog(N)+1 is sufficient. The first bottleneck in the CZ construction, pointed out by Cohen and Schulman,7 is the use of extractors as samplers in the construction. In a recent work, Ben-Aroya, Doron and I showed how to solve this problem using samplers with multiplicative error.8 Furthermore, Dodis et al. showed such samplers are related to low entropy-gap condensers, and Yevgeniy gave a series of talks on such condensers and their applications, mainly in cryptography, during the Spring 2017 Simons Institute program on Pseudorandomness.9 With this, the currently best explicit construction has log K=loglog(N) polylogloglog(N). The extra polylogloglog(N) factor is because current explicit t-non-malleable constructions, even for a constant t, have a suboptimal dependence on ε.

Yet, in my opinion, the most pressing bottleneck is of a completely different nature. The CZ result efficiently constructs a two-source (K,ε) extractor for small values of K, but a large error ε. Specifically, the algorithm computing E(a,b) has running time poly(1/ε), and if we take explicitness to mean running time polynomial in the input length log(N), we can only hope for error which is 1/polylog(N). In contrast, a straightforward probabilistic method argument shows we can hope for an ε which is polynomially small in K. The currently best low-error constructions are Bourgain’s 2-source extractor requiring log and Raz’ extractor which allows one source to be (almost) arbitrarily weak, but requires the other source to have min-entropy rate above half (the entropy rate is the min-entropy divided by the length of the string). At the Simons Institute program on Pseudorandomness, we were wondering whether the CZ approach that allows both sources to be weak can be somehow modified to give a low-error construction?

CONTINUE READING

Research Vignette: Promise and Limitations of Generative Adversarial Nets (GANs)

by Sanjeev Arora, Princeton University and Institute for Advanced Study

If we are asked to close our eyes and describe an imaginary beach scene, we can usually do so in great detail. Can a machine learn to do something analogous, namely, generate realistic and novel images different from those it has seen before? One expects that some day, machines will be creative and be able to generate even new essays, songs, etc., but for this article, let’s discuss only images. In machine learning, this goal of generating novel images has been formalized as follows. We hypothesize that realistic images are drawn from a probability distribution on the (vast) space of all possible images. Humans appear to be able to imagine novel samples from this vast distribution after having seen a reasonably small number of examples that arose in their personal experience. We would like machines to do the same: sample from the distribution of all realistic images. While this framing of the problem appears anticlimactic – reducing creativity/imagination to the more prosaic act of learning to generate samples from a vast distribution – it is nevertheless a powerful framework that already presents difficult computational hurdles. Many past approaches for solving this distribution learning problem used explicit statistical models, and tended to end in failure (or at best, tepid success) because real life data is just too complicated to capture using simple models.

This article is about Generative Adversarial Nets (GANs), a proposal by Goodfellow et al. in 20141 to solve this task by harnessing the power of large-scale deep learning (sometimes also called neural net training). Specifically, in the last 5-6 years, deep learning has become very successful at teaching machines to recognize familiar objects in images, in the sense that they can give labels to scenes such as street, tree, person, bicyclist, parked cars with precision approaching or exceeding that of humans. Perhaps this ability can lead naturally to a solution of the sampling problem?

CONTINUE READING

Research Vignette: Setting Posted Prices Under Uncertainty

by Amos Fiat, Tel Aviv University

Selfish Behaviour and Uncertainty
The author had the undeserved great fortune to be invited to two semesters at the Simons Institute for the Theory of Computing. The author had a truly wondrous time and is immensely grateful for the this fantastic opportunity. The opportunity to hear and interact with the amazing people attending was priceless. Thus, when asked to write a short Research Vignette,1 the author felt that it would be churlish to object strenuously. One can but hope that this is not yet another example of an error in judgment.2

Aspects of uncertainty have been studied in many disciplines, including philosophy, psychology, physics, the life sciences, economics and computer science, among others. In this Vignette, we address some aspects and models of uncertainty in statistics, computer science, and economics.

Temporal uncertainty occurs where the future or certain aspects of the future are unknown. Decisions made today (e.g., agreeing to write this article) may have unforseen consequences tomorrow (when the disastrous NYT critique is published). Optimal stopping theory,3 prophet inequality settings,4 secretary settings,5 and competitive analysis of online algorithms6 all deal with aspects of and models for temporal uncertainty. This vast body of work includes both Bayesian and worst-case models.

Another element of uncertainty arises when selfish agents interact. Clearly, it is useful to know how much a customer is willing to pay before trying to sell her something. Given that the techniques used by Torquemada are often condemned,7 one needs to be more subtle when considering the impact of private information.

One approach is to consider rational behavior and study resulting equilibria.8 Pricing equilibria (competitive equilibria) have been studied in many settings.9 Social choice theory10 and mechanism design11 seek to engineer algorithms so as to achieve certain desired goals. A wide variety of equilibrium notions have been defined, including dominant strategy, Bayes-Nash, and many others.

The computer science outlook is to quantify things – in this case, how good an outcome arises in equilibria when compared to optimal outcomes,12 and what mechanisms can be implemented in poly time.13

The setting of interest
Consider a model of online decision-making mechanisms in which selfish agents arrive sequentially, and the mechanism decides upon an outcome and payment for each arriving agent, where payments are used to align the incentives of the agent with that of the mechanism. There is temporal uncertainty because the preferences of the agents (their types) are unknown. The agent may have an opportunity to specify her type, but the decisions made by the mechanism might not be in the best interest of the agent. This may result in the agent strategically misrepresenting her preferences so as to achieve a better outcome for herself. A mechanism is truthful if it is always in the best interest of the agent to report her type truthfully.

For some time I have been obsessed with a specific class of truthful online mechanisms that take the form of dynamic posted prices.14 The assumption here is that the future is entirely unknown, and can be determined adversarially with no guarantees on future behavior. Dynamic pricing schemes are truthful online mechanisms that post prices for every possible outcome, before the next agent arrives. Then, the agent chooses the preferred outcome — minimizing the cost for the outcome plus the price tag associated with the outcome.

Online problems are either minimization problems (e.g., minimize the sum of costs) or maximization problems (e.g., maximize the number of agents served). Although optimal solutions to maximization/minimization objectives can be cast as the other, the competitive ratio is quite different in the two settings. Online algorithms have been devised with both maximization and minimization objectives. The technique of “classify and randomly select” often gives simple randomized algorithms for maximization objectives which also naturally translate into truthful mechanisms. In contrast, minimization objectives (e.g., k-server and makespan) require entirely different techniques. Converting online algorithms into mechanisms without performance degradation opens up an entire new class of problems for which incentive compatible mechanism design is applicable.

We note there are many other problems where posted prices play a major role, within widely differing models, information states, and various stochastic assumptions.15

A dynamic pricing scheme is inherently truthful, since prices are determined irrespective of the type of the next agent. Posted price mechanisms have many additional advantages over arbitrary truthful online mechanisms. In particular, such mechanisms are simple: agents need not trust or understand the logic underlying the truthful mechanism, agents are not required to reveal their type, and there is no need to verify that the agents follow the decision made by the truthful online mechanism.

A posted price mechanism is a truthful online algorithm, and as such, can perform no better than the best online algorithm. The main goal of this approach is to study the performance of dynamic posted price mechanisms (quantified by the competitive ratio measure) and compare them with the performance of the best online algorithm. One may think of this problem as analogous to one of the central questions in algorithmic mechanism design in offline settings: compare the performance of the best truthful mechanism (quantified by the approximation ratio measure) with the performance of the best non-truthful algorithm.

In a paper at EC 2017, Feldman, Roytman and I presented constant competitive dynamic pricing schemes for makespan minimization in job scheduling. Events represent jobs; the job type contains the job’s processing times on various machines. Agents seek to complete their job as soon as possible, and therefore prefer to be assigned to a machine whose load (including the new job) is minimized.16 One can consider so-called related machines (where machines have speeds, and jobs have some quantity of work to be done) and unrelated machines (where the time associated with a job arbitrarily depends on the machine). Previous online algorithms for the problem17 are not truthful — in that a job may misrepresent its size so as to get a preferential assignment to a machine.

An online truthful mechanism for this setting determines an allocation and payment for each arriving agent upon arrival. That is, upon the arrival of a job, based on the job’s processing times, the mechanism assigns the job to some machine and determines the payment the agent should make. The cost of an agent is the sum of the machine’s load (including her own processing time) and the payment. Each agent seeks to minimize her cost.

A dynamic posted price mechanism for this setting sets prices on each machine, before the next agent arrives (prices may change over time). The next agent to arrive seeks to minimize her cost, i.e., the load on the chosen machine (including her own load) plus the posted price on the machine. The agent breaks ties arbitrarily.

An example of dynamic pricing: Makespan minimization, related machines
We now motivate how dynamic pricing is useful, via a small example. We do so by comparing the schedule produced without any pricing to the schedule produced via a dynamic pricing scheme. Schedules obtained without pricing are equivalent to schedules produced by the greedy algorithm that assigns each job j to a machine i that minimizes ℓi(j − 1) + (the load on machine i prior to the arrival of job j, plus the additional time required for job j itself).

Figure 1: Related machines: the optimal makespan, the greedy algorithm, and dynamic pricing. The algorithm for setting dynamic prices is missing.


In our example (given as Figure 1) there are m = 3 machines, with speeds s1 = , s2 = (1 + ϵ), and s3 = 1 + 2ϵ; and n = 3 jobs, with sizes p1 = (1 + ϵ), p2 = , and p3 = 1 + 2ϵ. The left, middle, and right columns show the optimal assignment, the greedy assignment, and the assignment obtained by our dynamic pricing scheme, respectively. In the middle and right columns, the arrival order is from bottom to top.

Optimal makespan: The optimal makespan is L = 1, achieved by assigning job 1 to machine 2, job 2 to machine 1, and job 3 to machine 3.

Greedy: The greedy algorithm assigns job 1 to machine 3, since the machine i that minimizes is the fastest machine (initially, all loads are 0). Job 2 is also assigned to machine 3, since ℓ3(1) + = < = < (for sufficiently small ϵ > 0). Lastly, job 3 is also assigned to machine 3, since ℓ3(2) + = < = < . Hence, the greedy algorithm assigns all jobs to machine 3, resulting in a makespan of ≈ 2.

Pricing scheme: Our dynamic pricing scheme sets prices before the arrival of each job, which are independent of the type of the incoming job. We omit the explanation as to how to set these prices.18 The important aspect is that these prices are set before knowing what the next job will be. The performance guarantee must hold irrespective of the future. Let cij be the cost of job j on machine i; this is the sum of the completion time and the additional posted price currently on the machine.

Job 1 chooses machine 1, since c11 = + π11 = 1 + ϵ < 1 + π21 = + π21 = c21 < + 1 + π21 + π31 = c31.

Prior to the arrival of job 2, the dynamic pricing scheme sets new prices (see Figure 1), and job 2 chooses machine 2, since c22 = + π22 < + 1 + ϵ + = ℓ1(1) + = c12 < + 2 ≈ + π32 = c32.

Finally, prior to the arrival of job 3, the dynamic pricing scheme sets yet new prices and  job 3 chooses machine 3, since c33 = + π33 , while c13 = ℓ1(2) + = 1 + ϵ + 2p3 ≈ 3 and c23 = ℓ2(2) + + π23 = + + π23 ≈ 3.

Since machine 1 has the highest load, the schedule produced by our dynamic pricing scheme achieves a makespan of ℓ1(3) = 1 + ϵ. This example can be extended to show that greedy can be as bad as Ω(log m)-competitive, while in contrast our dynamic pricing scheme is O(1)-competitive.

During a talk in the Fall 2017 Simons Institute program on Algorithms and Uncertainty, I proposed that one study dynamic pricing schemes for flow time; and indeed, one definite outcome of this semester is a paper by Im, Moseley, Pruhs and Stein that comes up with a constant competitive dynamic pricing scheme to minimize the maximum flow time (the maximum time an agent is in the system). This improves on the previous online algorithm for the problem in terms of the competitive ratio, and has the added advantage of being truthful and simple.

Online algorithms, online mechanisms, and dynamic pricing
One critical question is the following: where do the boundaries lie between online algorithms, online mechanisms (not dynamic pricing) and dynamic pricing? In our EC paper, we show that there exist problems – makespan minimization on unrelated machines, where the pair (job, machine) determines the processing time – where good online mechanisms exist but there are no good dyanmic pricing schemes.

The connection between online mechanisms and dynamic pricing schemes is not entirely trivial.19 Dynamic pricing schemes are obviously truthful mechanisms, but converting an arbitrary online mechanism into a dynamic pricing scheme may not be possible.

We can show that online mechanisms with certain performance guarantees that also have properties a to c below can be converted into dynamic pricing schemes with the same guarantee: a) the online mechanism must be prompt (where the payments are determined immediately); b) ties in agent utilities do not arise, or can be resolved arbitrarily without harming performance guarantees — this is not an issue in the mechanism-design setting because the mechanism may decide how to break ties, but it is an issue with pricing schemes; and c) the mechanism does not require “voluntary revelation” of agent types – in pricing schemes, the only information available to the dynamic pricing scheme is what the previous agents actually chose; pricing schemes are not direct revelation schemes.

Afterword
It has been a pleasure discussing these and other problems with researchers visiting the Simons Institute, and at my home site, Tel Aviv University. I am grateful for discussions with and gracious instruction from Nikhil Bansal, Avrim Blum, Alon Eden, Michal Feldman, Anupam Gupta, Ilia Gorlick, Haim Kaplan, Anna Karlin, Dick Karp, Thomas Kesselheim, Bobby Kleinberg, Elias Koutsoupias, Stefano Leonardi, Christos Papadimitriou, Kirk Pruhs, Tim Roughgarden, and Matt Weinberg. In various combinations, we have tried to attack many other online problems via dynamic pricing. Many of these are wide open; some (minor) progress has been made on others. Many an idea arose in the numerous talks and discussions at the Simons Institute.

Notes

1Merriam-Webster online dictionary: 2a) a short descriptive literary sketch; 2b) a brief incident or scene (as in a play or movie).

2According to Aristole (Poetics, 4rd century BCE), an ideal tragedy is caused by error in judgment by the protagonist, and not by unavoidable error.

3Wald (1945), Arrow, Blackwell and Girshick (1948); Snell (1952).

4Cayley (1875); Krengel, Sucheston and Garling (1977).

5Kepler (1613); Martin Gardner (1960); Gibert and Mosteler (1966).

6Sleator and Tarjan (1985).

7Recent political developments in North America notwithstanding.

8von Neumann (1928).

9Cournot (1838); Walras (1874); Arrow and Debreu (1954).

10Condorcet (1785); Arrow (1951); Gibbard and Satterthwaite (1973,1975).

11Hurwicz, Maskin and Myerson (2007).

12Koutsoupias and Papadimitriou (1999).

13Nisan and Ronen (1999).

14Fiat, Mansour and Nadav (2008) — Packet routing; Cohen-Addad, Eden, Fiat and Jeż (2015) — Task systems, k-server, and metrical matching; Feldman, Fiat and Roytman (2017) — Makespan minimization; Im, Moseley, Pruhs and Stein (2017), the latter two discussed herein.

15E.g., Myerson’s virtual values, prophet inequalities, secretary problems, combinatorial markets – Feldman, Gavin and Lucier (2013, 2015 and 2016); Envy-free pricings – Guruswami, Hartline, Karlin, Kempe, Kenyon and McSherry (2005); and many others.

16In this interpretation, “load” is the time required by the server to deal with all current jobs in the server queue, and jobs are processed in a first-in, first-out manner, i.e., jobs enter a server queue.

17Awerbuch, Azar, Fiat, Plotkin and Waarts (1993).

18EC17 and on the archive.

19Many thanks to Moshe Babaioff, Liad Blumrosen, Yannai A. Gonczarowski and Noam Nisan for discussions helpful in clarifying this point.