# Where Do Q-Functions Come From?

by Sean Meyn (University of Florida) and Gergely Neu (Pompeu Fabra University)

One theoretical foundation of reinforcement learning is optimal control, usually rooted in the Markovian variety known as Markov decision processes (MDPs). The MDP model consists of a state process, an action (or input) process, and a one-step reward that is a function of state and action. The goal is to obtain a policy (function from states to actions) that is optimal in some predefined sense. Chris Watkins introduced the Q-function in the 1980s as part of a methodology for reinforcement learning. Given its importance for over three decades, it is not surprising that the question of the true meaning of Q was a hot topic for discussion during the Simons Institute’s Fall 2020 program on Theory of Reinforcement Learning.

This short note focuses on interactions at the start of the program, and research directions inspired in part by these interactions. To start with, who is Q? Was this code for one of Watkins’ friends at Cambridge? The question was posed early on, which led to an online investigation. The mystery was shattered through a response from Chris: we now know that the letter Q stands for quality, not Quinlyn or Quincy. To discuss further questions and potential answers requires some technicalities.

The discounted-cost optimality criterion is a favorite metric for performance in computer science and operations research, and is the setting of the original Q-function formulation. The definition requires a state process $$\{X_k : k\ge 0\}$$ and an action (or input) process $$\{A_k : k\ge 0\}$$, evolving on respective spaces (which are assumed discrete in this note). There is a controlled transition matrix $$P$$ that describes dynamics: $$X_{k+1}$$ is distributed according to $$P(\cdot|x,a)$$ when $$X_k=x$$ and $$A_k=a$$, for any action sequence that is adapted to the state sequence.

With $$\gamma$$ denoting the discount factor, the Q-function is the solution to a nonlinear fixed-point equation $$T^*Q = Q$$ in which $$T^*$$ is the Bellman operator: $\left(T^*Q\right)(x,a) = r(x,a) + \gamma \mathbb{E}_{X’\sim P(\cdot|x,a)}\left[\max_{a’} Q(X’,a’)\right]$ This must hold for each state-action pair $$(x,a)$$, with the maximum over all possible actions. This is a version of the dynamic programming (DP) equation that has been with us for about seven decades.

The magic of Q-learning, which is based on this DP equation, is that the maximum appears within an expectation. This makes possible the application of Monte Carlo methods to obtain an approximate solution based solely on observations of the actual system to be controlled, or through simulations.

One core idea of modern reinforcement learning (RL) is to find approximate solutions of the DP equation within a function class (e.g., neural networks, as popularized by the deep Q-learning approach of Mnih et al., 2015). While success stories are well-known, useful theory is scarce: we don’t know if a solution exists to an approximate DP equation except in very special settings, and we don’t know if a good approximation will lead to good performance for the resulting policy. We don’t even know if the recursive algorithms that define Q-learning will be stable — estimates may diverge to infinity.

There are many ways to read these negative results, and indeed many articles have been written around this subject. Our own reading is probably among the most radical: without understanding the issues around the existence of solutions to these DP equation approximations or their interpretation, we should search for alternative approximations of dynamic programming suitable for application in RL.

These concerns were raised at Sean Meyn’s boot camp lecture, where he called on listeners to revisit an alternate foundation of optimal control: the linear programming (LP) approach introduced by Manne (1960) and further developed by Denardo (1970) and d’Epenoux (1963). The message was greeted with enthusiasm from some attendees, including Gergely Neu, who responded, “You have blown my mind!” He had been working on his own formulation of this idea, which became logistic Q-learning (more on this below).

The least-squares approach that is central to most traditional RL algorithms (such as DQN) is replaced by an LP, so that we move from $$L_2$$ to $$L_\infty$$ approximation. There is ample motivation for this point of view:

• Lyapunov theory and the theory of inverse dynamic programming provide firm motivation: a good approximation of the DP equation in an $$L_\infty$$ sense implies strict bounds on closed-loop performance of the resulting policy. You can learn more about this theory in standard RL texts, and Meyn’s new RL monograph to appear this year.

• The LP approach reframes the DP approximation as a numerical optimization problem rather than a root-finding problem. Existence of a solution is guaranteed under minimal assumptions, even when working with a restricted set of candidate solutions.

Despite these advantages, LP-based approaches were left out of the mainstream RL tool kit until recently. One reason may be a glaring gap in the framework: the classical LPs are phrased in terms of state-value functions and not Q-functions! This means it was not obvious how to bring in Monte Carlo methods for algorithm design.

In 2020, we independently discovered the following LP that can be used to cast Q-learning as a constrained optimization problem — the so-called DPLP: \begin{aligned} \mbox{minimize } \qquad & \mathbb{E}_{X\sim\nu_0}\left[\max_{a} Q(X,a)\right] \\ \mbox{subject to } \qquad & Q(x,a) \ge r(x,a) + \gamma \mathbb{E}_{X’\sim P(\cdot|x,a)}\left[\max_{a’} Q(X’,a’)\right]\end{aligned} Without any constraints on the function class, the optimizer is the solution to the DP equation. The constraints amount to the DP inequality $$Q \ge T^* Q$$, which tells us only negative deviations should be penalized! This is to be contrasted with the commonly used squared Bellman error criterion that penalizes deviations of both signs. In seeking an approximation within a function class $$\mathcal{F}$$, it is not at all difficult to ensure a feasible solution exists, even when there is no theory to ensure that an approximate Bellman optimality operator has a fixed point.

There are a variety of possible ways of turning this insight into practical algorithms. One path (taken by Bas-Serrano et al., 2021) is to replace the hard constraint in the LP with a smooth upper bound on the expected constraint violations, resulting in the following unconstrained minimization problem: $% \mbox{minimize } \mathbb{E}_{X\sim\nu_0}[\max_a Q(X,a)] + \frac 1\eta \log % \mathbb{E}_{(X,A)\sim\mu}[e^{\eta(r(X,A) + \gamma \mathbb{E}_{X’\sim P(\cdot|X,A)}[\max_{a’} Q(X’,a’)] – Q(X,A))}]. \mbox{minimize} \;\;\;\;\;\;\,\, \mathbb{E}_{X\sim\nu_0}\left[\max_a Q(X,a)\right] + \frac 1\eta \log \mathbb{E}_{(X,A)\sim\mu}\left[e^{\eta((T^*Q)(X,A) – Q(X,A))}\right].$ Bas-Serrano et al. (2021) dubbed the above objective function the “logistic Bellman error” and have shown that it enjoys several favorable properties that make it an ideal objective for stochastic optimization. Interactions at the Simons Institute led to the development of another related algorithmic framework called “convex Q-learning,” in which the hard constraints in the LP are replaced with a different continuous approximation (Lu et al., 2021).

It is straightforward to derive practical algorithms from both approaches by replacing the expectations with sample averages. This approach led to good empirical performance, along with the beginnings of theory for convergence and performance bounds.

The LP approach had been simmering in the background over the prior decade: de Farias and Van Roy (2003) proposed an LP approach for approximate dynamic programming, and Mehta and Meyn (2009) contains foundations for convex Q-learning, including a variant of the DPLP for deterministic systems in continuous time.

The new techniques bring challenges that must be overcome before these LP-based approaches can take on the world. One is that these optimization problems can be very poorly conditioned. Another is that the approximation may greatly underestimate the true Q-function when using a poor choice of function class for approximation. We remain optimistic: just like the RL community has successfully developed numerical recipes for addressing the much more severe issue of nonexistence of fixed points to the approximate Bellman operators, we believe that similar know-how can be developed for LP-based methods if theorists and practitioners join forces. Now that we understand where Q comes from, let us decide together where it will go next!

Acknowledgment. The authors owe a great debt to the Simons Institute for providing inspiration and interactions throughout the fall program. In particular, our joint work (Lu et al., 2021) was in large part an outcome of discussions at the meeting. The LP approach is one theme of the new textbook (Meyn, 2022), and the impact of the program is seen throughout the book.

• A. S. Manne. Linear programming and sequential decisions. Management Science 6(3), pages 259–267, 1960.

• E. V. Denardo. On linear programming in a Markov decision problem. Management Science 16(5), pages 281-288, 1970.

• F. d’Epenoux. A probabilistic production and inventory problem. Management Science 10(1), pages 98-108, 1963.

• D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6), pages 850–865, 2003.

• J. Bas-Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3610-3618, 2021.

• P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Conference on Decision and Control, pages 3598–3605, Dec. 2009.

• F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control Conference, pages 4749–4756. IEEE, 2021.

• S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press (to appear), Cambridge, 2022. Draft manuscript available at https://meyn.ece.ufl.edu/2021/08/01/control-systems-and-reinforcement-learning/

• V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature 518 (7540), pages 529-533, 2015.

This site uses Akismet to reduce spam. Learn how your comment data is processed.