Warren B Powell

Professor Emeritus, Princeton University

Chief Analytics Officer, Optimal Dynamics

Recently, the study of sequential decisions have been grouped under a growing umbrella of research called “reinforcement learning.” The problem today is that when you say you are going to solve a problem with “reinforcement learning” (or you want to take a course in “reinforcement learning”), not even specialists in the field know what this is. (Side note: a student in my grad course on this topic described “reinforcement learning” as “stochastic optimization without the math”!)

When reinforcement learning first emerged as a solution approach in the 1980’s (the term dates back to work in psychology in the early 1900s), it was a well defined method for a well defined class of problems (which I review below). Today, it has broadened to a much wider class of methods, and a much wider class of problems, which the RL community seems to have resisted trying to define. In this webpage, I am going to offer a suggestion.

I will begin with a little history, and an observation of the dramatic growth of the field that has taken place in the last 10 years. I then give a description of reinforcement learning offered by Professor Ben van Roy (Stanford) which associates “reinforcement learning” with a community. Next I offer an alternative strategy, which is to describe the field in terms of a problem class (sequential decision problems) and a set of methods (four classes of policies, described below).

Comments on this webpage are appreciated. Please enter here.

## A little history

The first use of value function approximations to help overcome the “curse of dimensionality” in dynamic programming occurred in 1959 with a paper coauthroed by Richard Bellman (the father of dynamic programming), but after this paper this line of research died in operations research until the late 1980’s.

Then, beginning with the work of Paul Werbos’ dissertation in 1974, the optimal control community began using neural networks to approximate a value function (called the “cost-to-go” function in the controls literature), which is widely used in robotic control problems (a popular topic in reinforcement learning). This line of research has remained very active since then. The optimal control community also introduced novelties such as “linear decision rules” (which are optimal in an important problem class called linear quadratic regulation), and lookahead policies under the name of “model predictive control.” All of this work pre-dated the emergence of reinforcement learning.

In the 1990’s, reinforcement learning emerged as a method for solving (approximately) sequential decision problems using the framework of “Markov decision processes.” The idea proceeded by estimating the value Q(s,a) of being in a state “s” and taking an action “a” using two basic equations:

{\hat q}^n(s^n,a^n) = r(s,a) + \max_{a'} {\bar Q}^{n-1}(s^{n+1},a') \\{\bar Q}^n(s^n,a^n) = (1-\alpha) {\bar Q}^{n-1}(s^n,a^n) + \alpha {\hat q}^n(s^n,a^n)where r(s^n,a^n) is the reward from being in a state s^n and taking an action a^n (chosen according to some rule), and where we have to simulate our way to the next state s^{n+1} given that we are in state s^n and take action a^n. The parameter \alpha is a smoothing factor (also called a stepsize or learning rate) that may decline over time. These updating equations became known as Q-learning. They are simple and elegant, and minimize the complexities of modeling sequential decision problems.

Over time, researchers found that Q-learning did not always work (in fact, it often does not work, a statement that applies to virtually every method for solving sequential decision problems). The second edition of Sutton and Barto’s classic book *Reinforcement Learning* (which appeared in 2018) now includes parameterized policies, upper confidence bounding, Q-learning and a class of lookahead policies known as Monte Carlo tree search. Below I provide a brief sketch of four classes of policies, and these methods span all four classes. I claim these four classes are universal, which means *any* method proposed to solve a sequential decision problem will fall in these four (meta) classes.

## A growing field

Over the past decade, the set of activities that fall under “reinforcement learning” has simply exploded. This used to be a pocket community inside computer science doing what others called “approximate (or adaptive) dynamic programming.” Not any more.

According to google scholar, there were over 40,000 books and papers containing “reinforcement learning” in year 2020 alone. A quick search with google scholar confirms that you can find “reinforcement learning” in journals in computer science (machine learning), statistics, electrical engineering (huge number), operations research, industrial engineering, chemical engineering, civil engineering, materials science, biology, chemistry, physics, economics, finance, social sciences, education, psychology, history, philosophy, politics, linguistics, literature (!!),… (I gave up looking).

The graph to the right shows the growth over the years 2010, 2015 and 2020 in the citations of major terms associated

with sequential decisions (sometimes under uncertainty). “Reinforcement learning” stands out not just for its popularity, but its dramatic growth, with a 234 percent increase from 2015 to 2020 (“stochastic optimization” is also growing quickly, perhaps helping to emphasize the importance of handling uncertainty).

Not surprisingly, the scope of problems has expanded far beyond the problems familiar to the original core community. Even articles that attempt to communicate the breadth of applications (an example is here) do not come close to capturing all of the problems that are being addressed under the umbrella of “reinforcement learning.” And not surprising, this breadth of problems is pushing the boundaries of methodology. Just as the methods of reinforcement learning grew from the original focus on Q-learning (basically, a form of approximate dynamic programming) to the much wider range of methods in the second edition of Sutton and Barto’s *Reinforcement Learning, *the field, as we speak, is growing into new application domains, sparking a continuing growth in the library of methods for solving these problems.

It is for this reason that I think it helps to identify the problem class for this field. Every problem I have ever seen associated with reinforcement learning (which is a lot, but not all) can be described as a “sequential decision problem” which is modeled using the modeling framework I describe below (keep in mind this is just a sketch – click here for a more detailed presentation). Every method I have ever seen which referred to “reinforcement learning” (which is also a lot, but not all) can be described within the four classes of policies that I sketch below (click here for a more detailed summary).

## Reinforcement learning as defined by a community I

At a “reinforcement learning” workshop in 2018 (organized by people in optimal control), Ben van Roy (a renowned RL researcher at Stanford, and early pioneer of the field) described reinforcement learning as:

- A problem class consisting of an agent acting on an environment receiving a reward.
- A community that identifies its work as “reinforcement learning.”
- The set of methods developed by the community using the methods it self-identifies as “reinforcement learning” applied to the problem class.

This description effectively describes “reinforcement learning” as any method used by the reinforcement learning community for solving problems that can be described as an agent acting on the environment receiving a reward. The community used to be a fairly well defined set of researchers in computer science, but as the visibility of the field grew (through successes solving games such as chess and Go), so did the breadth of the communities identifying themselves with reinforcement learning.

Ben’s characterization of reinforcement learning effectively says that it is whatever the “RL community” is doing to solve a problem that involves “an agent acting on an environment receiving a reward.” I like the style of defining “reinforcement learning” by how it is used in the community as opposed to some abstract notion (this is how Webster defines words). But we have to be aware that as I outlined above, the scope of problems that fall under this general description is huge and growing quickly.

Further complicating the situation is that other communities working in this space (dynamic programming, optimal control, stochastic optimization, simulation optimization, stochastic programming, …) are going through the same growth in terms of the scope of problems and variety of methods. Today, the overlap between these communities is significant and growing, so much so that even these other communities are starting to describe their work as “reinforcement learning.” The snowball effect cannot be ignored.

Below, I am going to make a clear distinction between a problem class, and methods for solving the problem class. I will start by defining a problem class called “sequential decision problems” that are solved by finding the best policy that can be computed. I then provide an explicit organization of the different types of policies that make it possible for people to be more specific about the solution approach they are using. After this, I revisit the problem of defining “reinforcement learning” at the end. I need to emphasize that I am working to define what the field is doing, and how the terms are being used, rather than suggesting any change. The goal here is to provide a succinct umbrella that describes what is actually happening.

## Sequential decision problems

Sequential decision problems describe any setting where we intermingle making decisions and then making observations that affect the performance of the decision. *Any* sequential decision problem can be written as the sequence

*state, decision, information, state, decision, information, …*

or using notation (*S* being state, *x* being decision, and *W* being new information):

We are indexing “time” using an iteration counter *n*, but we could use time *t*, in which case we would write the state S_t, decision x_t, and exogenous information W_{t+1}.

Each time we make a decision we receive a contribution C(S^n,x^n). We make decisions x^n with a function (or policy) X^\pi(S^n). For these problems, the decisions x^n could be binary, discrete or continuous, scalar or vector-valued.

The goal is to find the best policy where we might optimize the objective function

\max_\pi \mathbb{E} \left\{\sum_{n=0}^N C(S^n,X^\pi(S^n))|S^0\right\}There are variations of this objective function; here we are maximizing the cumulative reward, but we might also maximize the final reward after a series of learning iterations. We can also use risk measures instead of expectations.

There are five components of any sequential decision problem:

- The state variable S^n that contains whatever we know (or believe) after
*n*iterations. - The decision x^n which is made using a (yet to be designed) policy X^\pi(S^n).
- The exogenous information (not present in all problems) W^{n+1} that is the information that arrives after we choose x^n.
- The transition function which describes how the state variables S^n given the decision x^n and exogenous information W^{n+1}. I like to write this as

- The objective function

Recognizing that objective functions can come in different styles, I claim that these five components describe *any* sequential decision problem. Note that this modeling framework is actually well established in the optimal control community, but it nicely describes the sequential decision problems found in reinforcement learning.

A key feature of this canonical framework is searching over policies (which are forms of functions). This is comparable to the challenge faced in machine learning where the search is for the best statistical model (which is also a function). However, the class of policies for making decisions is much broader than the set of functions used in machine learning (click here for a video that makes this case more clearly). I describe these next.

**Designing policies**

The core challenge of sequential decision problems is the design of policies (modeling uncertainty is also a nice challenge). There are two broad classes for designing policies:

**The policy search class**– These are policies that have to be tuned over time to work well in terms of the performance metric (maximizing rewards, minimizing costs, …). This tuning is usually done offline in a simulator, but may be done in the field.**The lookahead class**– These policies work by finding the action now, given the state we are in, that maximizes the one-period reward plus an approximation of the downstream value of the state that the action takes us to (this may be random).

Each of these classes are divided into two additional subclasses. The policy search class includes:

**Policy function approximations (PFAs)**– This is the simplest class, and includes analytical functions that map state to action. These include lookup tables (if it is cold, wear a coat; buy-low, sell-high policies in finance), parametric models (bid for ad-clicks = a-b \times (sales)), neural networks.**Cost function approximations (CFAs)**– These are parametrically modified optimization problems. For example, let x be the reduction in blood sugar from drug x, and let \bar{\mu}_x be our estimate of the reduction for a new patient. We could just choose the drug that maximizes \bar{\mu}_x, but we have a lot of uncertainty. Let \bar{\sigma}_x be the standard deviation of this estimate. A good policy for choosing a drug to test would be given by x=argmax_x\left(\bar{\mu}_x + \theta \bar{\sigma}_x\right), where \theta is a tunable parameter. Increasing \theta encourages trying drugs where there is more uncertainty. Parametric CFAs are widely used in practice, since they can be any deterministic optimization problem where tunable parameters have been introduced to handle uncertainty. For example, using a navigation system to estimate the travel time to the destination, but then adding a buffer for traffic delays, is a form of CFA. Airlines solve CFAs which are large integer programs with tunable parameters for schedule slack.

The lookahead class can also be divided into two subclasses:

**Policies based on value function approximations (VFAs)**– If we are in a state (this might capture how much inventory we have on hand, the state of a game, the location and velocity of a robot), and take an action (e.g. ordering more inventory, making a move, applying forces) that takes us to a new state, we might estimate the value of being in this new state. This estimate (we can rarely do this exactly) is known as a value function approximation. VFAs are comparable to the*Q*-factors in*Q*-learning, and can be approximated using any of the tools from machine learning.**Direct lookaheads (DLAs)**– The easiest example of a DLA is a navigation system that decides whether you should turn or go straight by finding the best path (or at least an approximation) to the destination. This path can be updated periodically during the trip. Another example of a DLA is when you plan several moves into the future when playing a game to determine what move you make now.

Here is my major claim: ** These four classes of policies (PFAs, CFAs, VFAs and DLAs) cover every possible method for making decisions over time that has been used by any of the research communities working on sequential decisions, along with any method used in practice.** In other words, I claim that these four classes are

*universal*, which means they include whatever you might be using now (without knowing it).

This does not mean I have solved everyone’s problems… these are meta-classes. Picking any of the four (meta) classes means you still have work to do choosing the precise policy within the meta-class, and you may even design hybrids. It is refining these classes for specific application domains (and even specific problem instances) that represents the bulk of the research in this field. Click here for my book chapter on designing policies.

## Reinforcement learning as defined by a community II

I am going to revisit Ben’s three characteristics of reinforcement learning, beginning with one fairly major tweak. I have two issues with his characterization of “an agent acting on the environment receiving a reward:”

- First, we have to define what it means to “act on the environment.” Is observing the environment an instance of “acting on the environment”? If someone pulls out a pair of binoculars to observe what types of birds are in an area, is that “acting on an environment”?
- Second, what if our “agent” has only one action to choose from, so she is always making the same action. I am quite sure that no-one would view that as a reinforcement learning problem. Implicit in the idea of “acting on the environment” is the idea that there is a choice of actions. In other words, the agent needs to make a decision.

So, my “tweak” to Ben’s characterization of reinforcement learning would go as follows:

- An agent making a decision that produces a contribution (or cost, or some performance metric).
- The set of methods developed by the community using the methods it self-identifies as “reinforcement learning” applied to the problem class.
- A community that identifies its work as “reinforcement learning.”

Now, replace “reinforcement learning” with the name of any field working in the broad area of sequential decision problems. This could be dynamic programming, stochastic optimization, optimal control (in discrete time), stochastic search (derivative-free or derivative-based), simulation-optimization, or active learning (also known as the multiarmed bandit problem).

Or, we could draw all these communities together by using:

- Problem class: any sequential decision problem, which might be with or without the exogenous information process (characterized by describing the application setting).
- Methods: any of the four classes of policies (and possibly a hybrid).

With these two dimensions, we can describe work we are doing by a) describing the application context, and then b) describing the class of policy (or policies if developing a hybrid) that you are using. This will be much more informative than someone who says they are using “reinforcement learning.”

## For additional background material:

I provide an overview of the field I am calling “sequential decision analytics” here. There is a 40 minute video of the field here.

I am working on a new book *Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions. *This is a graduate level text, but designed for a broad audience (that is, with a good course in probability and statistics). It is designed for people who work in applied domains, as well as those with a more analytic bent.

I also have an online book written for an undergraduate course that uses a teach-by-example style. The book uses a range of applications to illustrate modeling, as well as all four classes of policies. The book is called *Sequential Decision Analytics and Modeling*

There is a video tutorial on my framework from a talk I gave at Northwestern University in February, 2020:

- Part I – This describes the modeling framework, and illustrates all four classes of policies on pure learning problems (also known as bandit problems).
- Part II – This describes the four classes of policies in the context of a much broader class of dynamic problems (I call these “state-dependent problems”).