by Sean Meyn (University of Florida) and Gergely Neu (Pompeu Fabra University)
One theoretical foundation of reinforcement learning is optimal control, usually rooted in the Markovian variety known as Markov decision processes (MDPs). The MDP model consists of a state process, an action (or input) process, and a onestep reward that is a function of state and action. The goal is to obtain a policy (function from states to actions) that is optimal in some predefined sense. Chris Watkins introduced the Qfunction in the 1980s as part of a methodology for reinforcement learning. Given its importance for over three decades, it is not surprising that the question of the true meaning of Q was a hot topic for discussion during the Simons Institute’s Fall 2020 program on Theory of Reinforcement Learning.
This short note focuses on interactions at the start of the program, and research directions inspired in part by these interactions. To start with, who is Q? Was this code for one of Watkins’ friends at Cambridge? The question was posed early on, which led to an online investigation. The mystery was shattered through a response from Chris: we now know that the letter Q stands for quality, not Quinlyn or Quincy. To discuss further questions and potential answers requires some technicalities.
The discountedcost optimality criterion is a favorite metric for performance in computer science and operations research, and is the setting of the original Qfunction formulation. The definition requires a state process \(\{X_k : k\ge 0\}\) and an action (or input) process \(\{A_k : k\ge 0\}\), evolving on respective spaces (which are assumed discrete in this note). There is a controlled transition matrix \(P\) that describes dynamics: \(X_{k+1}\) is distributed according to \(P(\cdotx,a)\) when \(X_k=x\) and \(A_k=a\), for any action sequence that is adapted to the state sequence.
With \(\gamma\) denoting the discount factor, the Qfunction is the solution to a nonlinear fixedpoint equation \(T^*Q = Q\) in which \(T^*\) is the Bellman operator: \[\left(T^*Q\right)(x,a) = r(x,a) + \gamma \mathbb{E}_{X’\sim P(\cdotx,a)}\left[\max_{a’} Q(X’,a’)\right]\] This must hold for each stateaction pair \((x,a)\), with the maximum over all possible actions. This is a version of the dynamic programming (DP) equation that has been with us for about seven decades.
The magic of Qlearning, which is based on this DP equation, is that the maximum appears within an expectation. This makes possible the application of Monte Carlo methods to obtain an approximate solution based solely on observations of the actual system to be controlled, or through simulations.
One core idea of modern reinforcement learning (RL) is to find approximate solutions of the DP equation within a function class (e.g., neural networks, as popularized by the deep Qlearning approach of Mnih et al., 2015). While success stories are wellknown, useful theory is scarce: we don’t know if a solution exists to an approximate DP equation except in very special settings, and we don’t know if a good approximation will lead to good performance for the resulting policy. We don’t even know if the recursive algorithms that define Qlearning will be stable — estimates may diverge to infinity.
There are many ways to read these negative results, and indeed many articles have been written around this subject. Our own reading is probably among the most radical: without understanding the issues around the existence of solutions to these DP equation approximations or their interpretation, we should search for alternative approximations of dynamic programming suitable for application in RL.
These concerns were raised at Sean Meyn’s boot camp lecture, where he called on listeners to revisit an alternate foundation of optimal control: the linear programming (LP) approach introduced by Manne (1960) and further developed by Denardo (1970) and d’Epenoux (1963). The message was greeted with enthusiasm from some attendees, including Gergely Neu, who responded, “You have blown my mind!” He had been working on his own formulation of this idea, which became logistic Qlearning (more on this below).
The leastsquares approach that is central to most traditional RL algorithms (such as DQN) is replaced by an LP, so that we move from \(L_2\) to \(L_\infty\) approximation. There is ample motivation for this point of view:

Lyapunov theory and the theory of inverse dynamic programming provide firm motivation: a good approximation of the DP equation in an \(L_\infty\) sense implies strict bounds on closedloop performance of the resulting policy. You can learn more about this theory in standard RL texts, and Meyn’s new RL monograph to appear this year.

The LP approach reframes the DP approximation as a numerical optimization problem rather than a rootfinding problem. Existence of a solution is guaranteed under minimal assumptions, even when working with a restricted set of candidate solutions.
Despite these advantages, LPbased approaches were left out of the mainstream RL tool kit until recently. One reason may be a glaring gap in the framework: the classical LPs are phrased in terms of statevalue functions and not Qfunctions! This means it was not obvious how to bring in Monte Carlo methods for algorithm design.
In 2020, we independently discovered the following LP that can be used to cast Qlearning as a constrained optimization problem — the socalled DPLP: \[\begin{aligned} \mbox{minimize } \qquad & \mathbb{E}_{X\sim\nu_0}\left[\max_{a} Q(X,a)\right] \\ \mbox{subject to } \qquad & Q(x,a) \ge r(x,a) + \gamma \mathbb{E}_{X’\sim P(\cdotx,a)}\left[\max_{a’} Q(X’,a’)\right]\end{aligned}\] Without any constraints on the function class, the optimizer is the solution to the DP equation. The constraints amount to the DP inequality \(Q \ge T^* Q\), which tells us only negative deviations should be penalized! This is to be contrasted with the commonly used squared Bellman error criterion that penalizes deviations of both signs. In seeking an approximation within a function class \(\mathcal{F}\), it is not at all difficult to ensure a feasible solution exists, even when there is no theory to ensure that an approximate Bellman optimality operator has a fixed point.
There are a variety of possible ways of turning this insight into practical algorithms. One path (taken by BasSerrano et al., 2021) is to replace the hard constraint in the LP with a smooth upper bound on the expected constraint violations, resulting in the following unconstrained minimization problem: \[% \mbox{minimize } \mathbb{E}_{X\sim\nu_0}[\max_a Q(X,a)] + \frac 1\eta \log % \mathbb{E}_{(X,A)\sim\mu}[e^{\eta(r(X,A) + \gamma \mathbb{E}_{X’\sim P(\cdotX,A)}[\max_{a’} Q(X’,a’)] – Q(X,A))}]. \mbox{minimize} \;\;\;\;\;\;\,\, \mathbb{E}_{X\sim\nu_0}\left[\max_a Q(X,a)\right] + \frac 1\eta \log \mathbb{E}_{(X,A)\sim\mu}\left[e^{\eta((T^*Q)(X,A) – Q(X,A))}\right].\] BasSerrano et al. (2021) dubbed the above objective function the “logistic Bellman error” and have shown that it enjoys several favorable properties that make it an ideal objective for stochastic optimization. Interactions at the Simons Institute led to the development of another related algorithmic framework called “convex Qlearning,” in which the hard constraints in the LP are replaced with a different continuous approximation (Lu et al., 2021).
It is straightforward to derive practical algorithms from both approaches by replacing the expectations with sample averages. This approach led to good empirical performance, along with the beginnings of theory for convergence and performance bounds.
The LP approach had been simmering in the background over the prior decade: de Farias and Van Roy (2003) proposed an LP approach for approximate dynamic programming, and Mehta and Meyn (2009) contains foundations for convex Qlearning, including a variant of the DPLP for deterministic systems in continuous time.
The new techniques bring challenges that must be overcome before these LPbased approaches can take on the world. One is that these optimization problems can be very poorly conditioned. Another is that the approximation may greatly underestimate the true Qfunction when using a poor choice of function class for approximation. We remain optimistic: just like the RL community has successfully developed numerical recipes for addressing the much more severe issue of nonexistence of fixed points to the approximate Bellman operators, we believe that similar knowhow can be developed for LPbased methods if theorists and practitioners join forces. Now that we understand where Q comes from, let us decide together where it will go next!
Acknowledgment. The authors owe a great debt to the Simons Institute for providing inspiration and interactions throughout the fall program. In particular, our joint work (Lu et al., 2021) was in large part an outcome of discussions at the meeting. The LP approach is one theme of the new textbook (Meyn, 2022), and the impact of the program is seen throughout the book.
References

A. S. Manne. Linear programming and sequential decisions. Management Science 6(3), pages 259–267, 1960.

E. V. Denardo. On linear programming in a Markov decision problem. Management Science 16(5), pages 281288, 1970.

F. d’Epenoux. A probabilistic production and inventory problem. Management Science 10(1), pages 98108, 1963.

D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6), pages 850–865, 2003.

J. BasSerrano, S. Curi, A. Krause, and G. Neu. Logistic Qlearning. In International Conference on Artificial Intelligence and Statistics, pages 36103618, 2021.

P. G. Mehta and S. P. Meyn. Qlearning and Pontryagin’s minimum principle. In Conference on Decision and Control, pages 3598–3605, Dec. 2009.

F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Qlearning. In American Control Conference, pages 4749–4756. IEEE, 2021.

S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press (to appear), Cambridge, 2022. Draft manuscript available at https://meyn.ece.ufl.edu/2021/08/01/controlsystemsandreinforcementlearning/

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature 518 (7540), pages 529533, 2015.