Trends in Machine Learning Theory

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series is disseminated as guest posts on different blogs in machine learning and theoretical computer science. This initiative is organized by the Learning Theory Alliance and is overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

This is the sixth and final post in the series, on trends in machine learning theory, written by Margalit GlasgowMichal Moshkovitz, and Cyrus Rashtchian.

Throughout the last few decades, we have witnessed unprecedented growth of machine learning. Originally a topic formalized by a small group of computer scientists, machine learning now impacts many areas: the physical sciences, medicine, commerce, finance, urban planning, and more. The rapid growth of machine learning can be partially attributed to the availability of large amounts of data and the development of powerful computing devices. Another important factor is that machine learning has foundations in many other fields, such as theoretical computer science, algorithms, applied mathematics, statistics, and optimization. 

If machine learning is already mathematically rooted in many existing research areas, why do we need a field solely dedicated to learning theory? According to Daniel Hsu, “Learning theory serves (at least) two purposes: to help make sense of machine learning, and also to explore the capabilities and limitations of learning algorithms.” Besides finding innovative applications for existing tools, learning theorists also provide answers to long-standing problems and ask new fundamental questions. 

Modern learning theory goes beyond classical statistical and computer science paradigms by: 

  • developing insights about specific computational models (e.g., neural networks) 
  • analyzing popular learning algorithms (e.g., stochastic gradient descent)
  • taking into account data distributions (e.g., margin bounds or manifold assumptions)
  • adding auxiliary goals (e.g., robustness or privacy), and 
  • rethinking how algorithms interact with and access data (e.g., online or reinforcement learning).

By digging deep into the basic questions, researchers generate new concepts and models that change the way we solve problems and help us understand emerging phenomena.

This article provides a brief overview of three key areas in machine learning theory: new learning paradigms, trustworthy machine learning, and reinforcement learning. We describe the main thrust of each of these areas, as well as point to a few papers from ALT 2021 (the 32nd International Conference on Algorithmic Learning Theory) that touch each of these topics. To share a broader view, we also asked experts in the areas to comment on the field and on their recent papers. Needless to say, this article only scratches the surface. At the end, we point to places to learn more about learning theory.

New Machine Learning Paradigms
The traditional learning theory framework, probably approximately correct (PAC) learning, defines what it means to learn a ground-truth classifier from a candidate class of possible classifiers. Alongside PAC learning is Vapnik-Chervonenkis (VC) theory, which characterizes the number of samples needed and sufficient to learn a classifier from a given class. The generalization analysis from VC theory is restricted to guarantees that hold independently of the data distribution — that is, even for worst-case distributions. Additionally, the VC/PAC learning paradigm suggests that whenever learning is possible, it can be accomplished by choosing the classifier that minimizes loss on the training data, called the empirical risk minimizer (ERM). 

This classical framework unfortunately fails to explain the empirical success of machine learning (ML). “The distribution-free setting, while it comes with the elegant VC theory, turned out to be unsatisfactory,” says Csaba Szepesvári. “Due to the oversimplified setting, the theory could not contribute meaningfully to understanding all kinds of learning methods such as learning with trees, boosting, neural networks, SVMs, or using any other nonparametric methods.” Researchers posit that stronger guarantees should be possible if we leverage natural assumptions about the data distribution, though identifying the right “natural assumptions” is a challenging task. Similarly, understanding which of many possible ERM solutions a learning algorithm chooses may yield better generalization results than those yielded by VC theory. 

Methods that provide distribution-specific guarantees aren’t new to learning theory. A canonical example is known as a margin bound, where the test error of a classifier is analyzed in terms of the margin that separates the different prediction categories. In one of the ALT best papers, Steve Hanneke and Aryeh Kontorovich prove generalization guarantees in terms of the size of the margin for two popular classification algorithms: support vector machines (SVMs) and the perceptron algorithm. The authors answer a core open question, showing that SVMs achieve the optimal margin bound!

Further work at ALT uses an assumption that data lies on a low-dimensional manifold to prove guarantees for generative models. Generative models synthesize original samples, such as images or text, that resemble training data, but without copying the data directly. While so-called generative adversarial networks work well in practice, few guarantees exist because it is challenging to statistically formulate the requirement of generating original samples. Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak S. Dalalyan consider a new framework in their paper, in which they guarantee originality by outputting a continuous distribution, which ensures that it is very unlikely to output training examples. If the training data is generated from a low-dimensional manifold, they show that it is possible to learn a good generator, which outputs a smooth transformation of a random point.

In another paper, Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu use distributional assumptions to show when unsupervised methods can use unlabeled data to learn useful representations of data. A linear function trained on these representations and some labeled data can then be used for downstream prediction of the labels. The key idea is to look at when data has multiview redundancy (MVR), which arises, for instance, when data is augmented: Under MVR, each data point can be viewed as a pair (X, Z), and the label Y can be predicted almost as well from X or Z as from the full pair. For instance, each pair might be two halves of an article, or two rotations of an image. The authors show how a theoretical approach called landmark embedding can produce a representation that enables low-error linear classification. Additionally, they analyze when the representations are learned implicitly while training a model to predict whether two views X and Z correspond to the same example, which is close to what is done in practice. 

Another new paradigm considers how the specific training algorithm affects which of many candidate ERM solutions are chosen. This is called the implicit bias of a training algorithm: If there are multiple equally good solutions, then why does an algorithm choose one over the other? This is particularly relevant when studying neural networks, which are typically overparameterized and can be trained to find many solutions with zero empirical risk. In one paper, Ziwei Ji and Matus Telgarsky characterize the implicit bias of using gradient descent to train a linear classifier with a general loss function. They show that the solution relates to the optimizer of a particular smoothed margin function. While this paper does not yield generalization guarantees, this type of implicit bias analysis can sometimes lead to generalization guarantees via margin bounds.

The goal of this area is to go beyond traditional learning theory paradigms by leveraging distributional and algorithmic properties. But a major open question is designing a more general mathematical theory that exploits such properties. Shay Moran offers a standard for new theory: “I hope that in the next 10 years we will develop more realistic models of learning, but I will insist that they still be mathematically clean.”

Trustworthy ML
Machine learning has inspired many new areas and technologies: personalized health care, drug discovery, advertising, résumé screening, credit loans, and more. However, these critical and user-centered applications require a higher standard of testing and verification because mistakes may deeply affect many people. Addressing these challenges has inspired a new field of research centered on making machine learning more trustworthy and reliable, which is the motivation for many of the ALT papers this year as well. An expert in the area, Kamalika Chaudhuri, says, “For my field, which is trustworthy ML, the theoretical goal and challenge remains modeling and frameworks.” She elaborates: “Coming up with new conceptual frameworks for learning has always been one of the core challenges in learning theory since its early days, and it is doubly important now.” Researchers have been exploring this direction in many areas, including privacy, data deletion, robustness, fairness, interpretability, and causality. 

Several ALT 2021 papers cover questions in privacy. The general goal is to understand how to modify existing learning methods to take into account privacy constraints. One of the papers, by Di Wang, Huanyu Zhang, Marco Gaboardi, and Jinhui Xu, considers generalized linear models in a differential privacy model. A central motivation is to understand the role of public, unlabeled data in improving the learnability of these problems in a private setting. Summarizing another direction, Gautam Kamath comments on his paper with co-authors Ishaq Aden-Ali and Hassan Ashtiani, “This paper focuses on a very simple but surprisingly challenging question: Can we learn a general multivariate Gaussian under the constraint of differential privacy? Prior works focused on restricted settings — for example, with bounded parameters or known covariance. We gave the first finite sample complexity bound for this problem, which evidence suggests is near optimal. The next question is to design a computationally efficient algorithm for this problem.” Another paper on privacy, by Mark Cesar and Ryan Rogers, studies the composition of various privacy mechanisms in the context of real-world data analytics pipelines.

Another aspect of respecting user privacy is allowing people to choose to stop sharing their data. Concretely, this means removing their data from data sets and ensuring that existing and future models do not make use of their data in any way. One name for this process is machine unlearning, and the main challenge is removing the data efficiently without retraining all models from scratch. One paper, by Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi, addresses this challenge. They propose ways to strategically update the model by using modified gradient descent methods. They also analyze this approach, and prove new upper and lower bounds for updating models after data deletion with their new optimization algorithm. 

Robust methods for machine learning and statistics aim to provide rigorous guarantees in the presence of outliers or adversarially modified data points. The field of robustness has been steadily growing as researchers uncover more and more models where deviations in the data can lead to unexpected and dramatic changes in model behavior. In ALT 2021, a paper by Jie Shen and Chicheng Zhang covers learning half-spaces nearly optimally even in the presence of malicious noise.

As a final and thought-provoking direction in trustworthy ML, Omer Reingold points out that we need to better understand “the meaning of individual probabilities/risk scores,” which are common ways that ML systems summarize or justify decisions. Ideally, the output of a model should be something that people can interpret directly and use to potentially modify their future actions. He elaborates that it is important to think about “the individual quantities (which imply important decisions) that ML is trying to approximate” and to answer “what does fitting the parameters of a model on the entire population imply for individuals and subcommunities?” This question brings to the forefront the fact that ML systems affect both individuals and groups of people, which is an important consideration when formulating rigorous definitions of fairness (e.g., see this book or this one).

Reinforcement Learning
Reinforcement learning (RL) is a framework for interactive learning where an agent interacts with an environment, and the agent’s actions govern the rewards it receives from the environment. Part of the motivation for studying RL is that relevant problems are everywhere. Sometimes the agents are autonomous vehicles. Other times, they are programs playing games like chess or go. People interact more and more with ML models, and hence, living our lives is actually being a part of a multiagent game where we, humans, and the ML models are the agents. “The most exciting direction in learning theory of recent years,” says Elad Hazan, “is adding rigorous theoretical guarantees to reinforcement learning.” 

The RL environment is typically modeled as a Markov decision process (MDP): a set of states, actions, and transition probabilities that determine the next state and reward given the agent’s current state and action. The agent uses a policy to choose its action from each state with the goal of maximizing its cumulative reward over time. A central challenge is balancing exploration (learning about the environment) and exploitation (spending time choosing actions in states where they can collect high rewards).

In the most basic setting, multi-armed bandits, there is a single state, and each action (or “arm”) leads to a stochastic reward. “Here, the theory is quite mature, though interesting problems remain in connection to the limits of how structure can be exploited,” Csaba Szepesvári says. In two works at ALT, by Marc Jourdan, Mojmír Mutný, Johannes Kirschner, and Andreas Krause and by Thibaut Cuvelier, Richard Combes, and Eric Gourdin, the authors show that efficient exploration is possible in a combinatorial semi-bandit setting. Here, the agent can choose an allowed set of arms in each step, and it receives a distinct reward for each chosen arm. While the number of action choices for the agent is larger, this more detailed feedback makes the problem tractable.

Beyond the stateless bandit setting, ML theorists are still figuring out how fast agents can learn how to play optimally in MDPs with finite state spaces and action spaces. Recent progress on this front has given lower bounds on the sample complexity required for agents to learn the best policy. In one ALT paper, the authors give a unified view of lower bounds for three distinct, but related, problems in RL. The hard MDP instances they construct to show lower bounds are based on hard instances for multi-armed bandit problems.

In the more challenging setting where the state space is infinite, a central question is whether the agent can learn from exploring a finite number of states, and generalize to perform well on unknown areas of the environment. For certain MDPs, generalizing is impossible, but some assumptions on the structure of the MDP may enable generalization. “While algorithm independent problem formulations existed and have been studied in the finite case, a quite recent development is to extend these to the case of ‘large’ environments where the use of function approximation techniques becomes crucial for achieving nontrivial results,” explains Csaba Szepesvári. 

Function approximation has to do with the optimal action-value function, which captures the long-term reward of playing a certain action from a given state. This function can sometimes be approximated by some simple class of functions. One of the strongest such assumptions is linear realizability, where the optimal action-value function is a linear function of some representation of the action and state. In one of the papers receiving a best paper award in ALT, Gellert Weisz, Philip Amortila, and Csaba Szepesvári show that even under this strong assumption of linear realizability, the agent needs a number of samples exponential in the length of the episode or the dimension of the representation in order to generalize. Looking forward, the goal is to follow the lead of these papers and better understand the landscape of sample complexity: When can we learn models with a polynomial number of samples, and when is an exponential number necessary?

Nearly every offline learning problem can be studied in an interactive setting, where inputs arrive in an online fashion and need to be processed immediately, which is common in many real-world settings. Models for interactive machine learning provide a framework for studying problems and algorithms in this more challenging setting. Beyond the MDP setting, interactive learning spans online learning (e.g., no-substitution clustering), nonstochastic control theory (e.g., robust controllers for dynamical systems), online convex optimization, and many more domains. 

We hope this provides a fairly broad view on some of the topics that people are researching right now in learning theory. Of course, there are many more areas that we don’t have space to describe: theory of deep neural networks, quantum algorithms for machine learning problems, human-centered considerations, learning with strategic agents and multiplayer games, convex/nonconvex optimization, federated and distributed learning algorithms, and many more. In general, as Gautam Kamath observes, “A lot of important questions in learning theory arise through interplay between the theoretical and applied machine learning communities.” To have a greater impact, it is important to collaborate with people doing empirical research, and to learn from the front lines about the most interesting phenomena to explain, or the challenges that do not seem surmountable by combining existing tools.

To learn more and to get more involved, we have listed a variety of resources (blogs, workshops, videos, etc.) that can help you get started in this area. As a final motivation for writing this article, we remark that people in the area are keenly aware that we need more young talent to help uncover truth and contribute groundbreaking ideas. As Gautam Kamath puts it, “There are far more interesting questions in learning theory than there are researchers to solve them.”

Places to learn more

Blogs: UCSD ML blog, Off the Convex Path, Windows On Theory, I’m a bandit, Francis Bach’s blog, Differential Privacy blog, Distill, The Gradient


Podcasts: TWIML, Gradient Dissent, Joy of x, The Robot Brains, Talking Machines, TalkRL, Underrated ML

Videos: Simons Institute, IAS deep learning workshop, One World ML, Trustworthy ML, Foundations of Data Science, RL Theory Virtual Seminars, iMSi: The Multifaceted Complexity of Machine Learning, Control Meets Learning 

Acknowledgments: We thank Kamalika Chaudhuri, Elad Hazan, Daniel Hsu, Gautam Kamath, Shay Moran, Omer Reingold, Csaba Szepesvári, and Claire Vernade for helpful comments and thoughtful quotes. We thank Kush Bhatia, Lee Cohen, Neha Gupta, Nika Haghtalab, Max Hopkins, Gautam Kamath, Gaurav Mahajan, and Uri Sherman for helpful feedback on initial drafts.

Margalit Glasgow is a PhD student in Stanford’s Computer Science Department, advised by Mary Wootters. Her research focuses on theoretical machine learning and random matrices.

Michal Moshkovitz is a postdoc at the Qualcomm Institute at UC San Diego. She received her PhD and MSc in computational neuroscience from the Hebrew University of Jerusalem, and her MSc in computer science from Tel Aviv University. Her research focuses on the foundations of AI, exploring how different constraints affect learning. She has worked on bounded-memory learning, explainable machine learning, and online decision-making in unsupervised learning. 

Cyrus Rashtchian is a postdoc at UC San Diego in computer science and engineering, and he received his PhD from the University of Washington. His research focuses on trustworthy machine learning, algorithms for big data, statistical reconstruction, and DNA data storage.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.