by Sean Meyn (University of Florida) and Gergely Neu (Pompeu Fabra University)

One theoretical foundation of reinforcement learning is optimal control, usually rooted in the Markovian variety known as Markov decision processes (MDPs). The MDP model consists of a state process, an action (or input) process, and a one-step reward that is a function of state and action. The goal is to obtain a policy (function from states to actions) that is optimal in some predefined sense. Chris Watkins introduced the Q-function in the 1980s as part of a methodology for reinforcement learning. Given its importance for over three decades, it is not surprising that the question of the true meaning of *Q* was a hot topic for discussion during the Simons Institute’s Fall 2020 program on Theory of Reinforcement Learning.

This short note focuses on interactions at the start of the program, and research directions inspired in part by these interactions. To start with, *who is Q*? Was this code for one of Watkins’ friends at Cambridge? The question was posed early on, which led to an online investigation. The mystery was shattered through a response from Chris: we now know that the letter *Q* stands for *quality*, not Quinlyn or Quincy. To discuss further questions and potential answers requires some technicalities.

The discounted-cost optimality criterion is a favorite metric for performance in computer science and operations research, and is the setting of the original Q-function formulation. The definition requires a state process \(\{X_k : k\ge 0\}\) and an action (or input) process \(\{A_k : k\ge 0\}\), evolving on respective spaces (which are assumed discrete in this note). There is a controlled transition matrix \(P\) that describes dynamics: \(X_{k+1}\) is distributed according to \(P(\cdot|x,a)\) when \(X_k=x\) and \(A_k=a\), for any action sequence that is adapted to the state sequence.

With \(\gamma\) denoting the discount factor, the Q-function is the solution to a nonlinear fixed-point equation \(T^*Q = Q\) in which \(T^*\) is the Bellman operator: \[\left(T^*Q\right)(x,a) = r(x,a) + \gamma \mathbb{E}_{X’\sim P(\cdot|x,a)}\left[\max_{a’} Q(X’,a’)\right]\] This must hold for each state-action pair \((x,a)\), with the maximum over all possible actions. This is a version of the dynamic programming (DP) equation that has been with us for about seven decades.

The magic of Q-learning, which is based on this DP equation, is that the maximum appears within an expectation. This makes possible the application of Monte Carlo methods to obtain an approximate solution based solely on observations of the actual system to be controlled, or through simulations.

One core idea of modern reinforcement learning (RL) is to find approximate solutions of the DP equation within a function class (e.g., neural networks, as popularized by the deep Q-learning approach of Mnih et al., 2015). While success stories are well-known, useful theory is scarce: we don’t know if a solution exists to an approximate DP equation except in very special settings, and we don’t know if a good approximation will lead to good performance for the resulting policy. We don’t even know if the recursive algorithms that define Q-learning will be stable — estimates may diverge to infinity.

There are many ways to read these negative results, and indeed many articles have been written around this subject. Our own reading is probably among the most radical: without understanding the issues around the existence of solutions to these DP equation approximations or their interpretation, we should search for alternative approximations of dynamic programming suitable for application in RL.

These concerns were raised at Sean Meyn’s boot camp lecture, where he called on listeners to revisit an alternate foundation of optimal control: the linear programming (LP) approach introduced by Manne (1960) and further developed by Denardo (1970) and d’Epenoux (1963). The message was greeted with enthusiasm from some attendees, including Gergely Neu, who responded, “You have blown my mind!” He had been working on his own formulation of this idea, which became logistic Q-learning (more on this below).

CONTINUE READING