Deep learning has advanced to the point where we’re seeing computers do things that would have been considered science fiction just a few years ago. Areas such as language translation, image captioning, picture generation, and facial recognition display major advances on a regular basis.
But certain artificial intelligence problems might not mesh well with deep learning’s traditional training algorithms, and these areas might require new ways of thinking.
Neural networks learn by taking tiny steps in the direction of an adequate solution (Figure 1). This means that the path neural networks navigate — called the loss function — needs to be relatively smooth.
But many real-life situations don’t provide anything close to the continuous loss function that neural networks require (Figure 2).
For instance, natural language processing (NLP) poses many challenges that can’t be solved through traditional machine learning gradient descent.
Let’s say we want an AI system to rewrite text into a more elegant form, and that a hypothetical “language effectiveness score” measures how clear, concise, and polished a sentence is.
Figure 3 shows two sentences that convey the same intent, but one is much better written than the other. We’d like our NLP writing system to be able to convert text 1 into text 2, thereby improving our language score from 0.9 to 1.4.
It would be difficult to train a neural network to accomplish this because there is no clear path from one text to the other: The loss function is undefined in most regions because there are few, if any, valid sentences between the scores of 0.9 and 1.4.
What if we could let our system experiment with language and explore new possibilities that perhaps even humans haven’t thought of? This is how intelligent agents like robots are trained in reinforcement learning (RL), and the most successful learning techniques for robots can be used in areas like NLP.
Deep Reinforcement Learning
Deep learning uses large, multilayered neural networks to analyze complex patterns observed in the world. Reinforcement learning is a type of machine learning in which an intelligent system learns to accomplish tasks by exploring its world and receiving feedback from its environment. Deep reinforcement learning (deep RL) combines deep neural networks with robot-style adventure, and we’re just beginning to see what a potent combination this can be.
Fundamentals of Reinforcement Learning
The star of the show in RL is the agent, which observes itself in a state and takes actions in its environment. Every time the agent takes an action or declines to take an action, the environment supplies the agent with a reward along with a new state. This process repeats indefinitely or until the agent reaches a terminal state.
This describes model-free RL because the agent is interacting directly with its environment (Figure 4). In model-based RL, there is a separate entity (the model) that receives the reward signals and updated states from the environment (Figure 5). The model tries to learn as much as it can about its world so that it can convey accurate feedback to the agent, which is now talking to the model rather than to the environment.
Sometimes the model is known in advance, such as through the laws of nature or rules in a game. Other times, model-based RL can be difficult to implement because it’s not easy to model a complicated, unpredictable world. But if it is feasible to employ such a model, it can ease RL training because often a model can provide more comprehensible information to the agent than the agent would receive directly from its surroundings.
How does an agent decide which actions to take? This is determined by the agent’s policy. There are several ways to train a policy, each of which has strengths and drawbacks. The approaches described below rely on a fitted model that can be of any type: a linear regression, a decision tree, a neural network, etc.
In this age of deep learning, neural networks seem to be the strongest contenders. It is unlikely that an RL agent will be able to train on every possible situation it could encounter. Deep neural networks are very good at generalizing from experience to situations they haven’t seen before, and this is one reason they are so often used in reinforcement learning. We’ll assume that the methods herein are deep RL techniques in that they are all based on deep neural networks.
One way an agent can make decisions is to estimate the expected value of every state it could end up in. The agent would then choose the action that places it in the state with the highest predicted value.
RL values are often discounted to weight nearer rewards more than distant ones. This is because we generally prefer rewards sooner rather than later. Also, this discounting drives the most uncertain rewards (the ones that occur far in the future) close to zero, which dampens the chaotic variance of rewards that can impede reinforcement learning.
One complication of this value-based approach is that it needs to work with a dynamics model, which gives the probability of transitioning from one state to another. For instance, on a crowded warehouse floor, a robot might see a high value in moving to a specific spot, but the transition model might say there’s a chance the robot won’t get there. Instead, the robot might choose a closer destination that is lower valued but is a more reliable endpoint.
Similarly, an RL language translator might see a high value in writing a specific word, but its transition model might say that the word can’t be placed in that position without breaking rules of grammar.
Q-learning accomplishes something like value learning, but it doesn’t require a transition model. Instead, it learns to evaluate the quality of actions based only on the state/action combinations that it has available to it. These states and actions are already visible to the agent, so the agent only needs to learn how to score its actions across different states.
Also, Q-learning can store past episodes and reuse them to enhance its learning. This is similar to training multiple epochs in supervised learning.
A potential disadvantage of Q-learning is that it’s not solving for the target we ultimately care about. Q-learning minimizes an error term based on the Bellman equation, and it can be hard to tell how different this is from our ideal training error, which would be based on discounted rewards.
One very direct approach to decision making that avoids some of the complications of value-based methods is to focus directly on the policy. This eliminates indirect calculations like value functions and Q-functions, and it establishes an uninterrupted connection between the agent’s choices and its environment.
Policy gradients adjust a neural network’s weights in the direction of positive rewards and away from negative rewards. A policy gradient method only needs numbers as input, and it can therefore handle discontinuous reward functions or even reward systems that are unknowable. Each iteration of policy gradient learning does something like this:
- Have an RL agent attempt its task numerous times. (These are called trajectories in RL as opposed to samples in deep learning.)
- For each trajectory, cache the reward that was received along with the neural network gradient that would increase the probability of taking that action.
- Create a weighted average of all the gradients from step 2 based on the reward that each trajectory received. This weighted average uses the sign and magnitude of the rewards. For instance, positive rewards use gradients with their original signs because this is the direction that increases the chance that those actions will be chosen in the future. In contrast, trajectories with negative rewards will have their gradients multiplied by a negative number to decrease the likelihood of those actions.
- Perform a stochastic gradient update of the neural network weights based on the gradient resulting from step 3.
This is a type of REINFORCE algorithm, and it increases the probability of positive-reward paths while decreasing the odds of negative-reward trajectories. This method doesn’t alter the paths, so an RL agent needs to experience many trajectories to evaluate them.
This approach is useful because it handles situations where the reward function can’t be approximated by a neural network. Traditional supervised learning, on the other hand, requires a loss function that is relatively well-conditioned along any path the optimization might take.
We can still use the automatic differentiation features of deep learning libraries like TensorFlow and PyTorch. But in conventional deep learning, we’re taking gradients with respect to the outcome as reflected by the loss function. Here, we’re taking gradients with respect to the probability of action and then combining those gradients based on the observed rewards.
Rewards and Advantages
The REINFORCE technique can create opportunities for problems that don’t fit nicely into deep learning’s loss-based gradient descent algorithms. This benefit comes at a cost, however. The REINFORCE approach is often extremely noisy because reward signals can fluctuate wildly while truly informative feedback can be hard to come by.
Therefore, a great deal of thought goes into how we can transform the rewards to achieve a more stable learning process. The reward term in Figure 6 is often described as the advantage — denoted by Â— and it consists of rewards that have been refashioned to reflect how much extra reward the trajectory received over a baseline expectation. The following sections describe some ways the rewards can be modified to make this learning algorithm more robust.
If the reward signals are already in a reasonable range, we can just use the raw rewards to construct our gradient update. But often we need to scale the rewards, such as dividing by their standard deviation, to try to make the learning process smoother.
In addition to rescaling the rewards, some people suggest also subtracting out their mean. This would make it so that above-average rewards contribute their gradients in the positive direction while rewards below the mean supply negative gradients.
Other people say that doing this will switch some positive rewards to negative (and vice versa) and that this is detrimental to training the agent’s decision-making policy. For example, if all the rewards are positive, it might be best to keep them that way and allow the neural network to learn more from the higher rewards than the lower ones.
We can increase the clarity of RL training by taking care of the temporal structure of the rewards. This means we often want to assign credit to an action based on the rewards received after that action rather than on the total reward of the entire trajectory of actions.
This is an example of how RL training works better when there is accurate credit assignment in matching actions with rewards.
We can smooth out the observed rewards by creating an advantage function, which predicts how each reward differs from the baseline expectation for that situation. The advantage function can take a variety of forms. For instance, the simple average of the observed rewards can be a baseline, and more sophisticated approaches use a value function based on the Bellman equation.
If we do use a value function as the baseline, this policy gradient approach becomes an actor-critic method. An actor-critic RL system uses observed rewards to train a separate neural network (the critic) that then provides reward signals to the agent (the actor). The critic improves its modeling by observing how the actor’s decisions cause the environment to dispense rewards.
An advantage of the actor-critic method is that the critic can supply the actor with generalized estimates of rewards rather than having the agent get whipsawed by lots of random variation in the actual rewards.
Policy gradient techniques are appealing for deep learning problems where the loss function is discontinuous, undefined in certain areas, or completely unknown. Yet these methods have some drawbacks. For example, once the observed trajectory data is used to update the policy neural network, that data is discarded. This is because that information came from actions dictated by the previous policy, not the neural network that exists after the update. This is called on-policy learning, which is when a policy is trained from its own actions, and it can be inefficient in its use of sample data.
Another disadvantage of policy gradient methods is that it can be difficult to choose a step size for the policy update. If the learning rate is too small, the neural network might not learn enough from positive outcomes that can be rare. If the step size is too large, a bad update can create a policy that recommends unproductive actions, and the data collected from thereon will be useless.
Recent advancements try to retain the strengths of policy gradients while addressing some of the weaknesses.
Trust Region Policy Optimization
If we are only able to use on-policy data once, is there any way we can squeeze more value out of that information? One option is to turn the simple policy gradient update into an optimization. An optimizer has more freedom to adjust the neural network weights to achieve the best solution given the available information. But there’s a problem: when an optimizer sees positive rewards, it will try to push the neural network in that direction infinitely (and the reverse for negative rewards).
Trust region policy optimization (TRPO) changes the update of the policy neural network to an optimization that is constrained to make sure the new policy doesn’t deviate too far from the old one. The idea is that in a region close to the current policy, we’ll trust the optimizer to make any changes it deems necessary to arrive at the best possible policy. Buy outside of this trust region, we want to restrict the optimizer from making significant changes.
The creators of TRPO use Kullback–Leibler (KL) divergence as a way of constraining this optimization. But the KL measure and other techniques required to make this method practical add a lot of complexity to what was a simple policy gradient update.
Proximal Policy Optimization
A successor to TRPO is proximal policy optimization (PPO), which also uses optimization to extract as much utility as possible out of limited data. In contrast to the constrained optimization of TRPO, PPO uses two trivial computer operations — the minimum and clip functions — to make sure the new, optimized policy isn’t too radically different from the previous one (Figure 7).
These simple modifications to the objective function effectively constrain the optimization by ensuring the optimizer sees no advantage in moving away from a region where we’re almost guaranteed to improve the policy.
The result is that we’ve added some sophistication to our training of the agent’s policy, but we’ve avoided the complicated optimization procedures of TRPO.
Innovations like PPO can make the strengths of reinforcement learning, like its robustness and exploratory nature, more applicable to problems that aren’t suitable for standard supervised learning.
At the heart of RL lies the idea of exploration, which permits an RL agent to perform actions that might not give the best immediate reward, but which could create possibilities for higher rewards in the future.
A basic exploration policy is called epsilon-greedy (ε-greedy). This is when an agent mostly takes actions that have the highest expected value (acting greedily), but occasionally (with a probability of ε) explores random actions.
This ε-greedy exploration can be tapered so that the probability of a random action decreases through time as the agent comes to experience most of the useful states.
Another philosophy of RL exploration is optimism in the face of uncertainty, which encourages an agent to perform actions that have more unpredictability in their estimated value. This exploration policy will diminish as the agent develops more certainty about which actions result in high rewards.
RL research is advancing at a staggering pace, and here are some key trends to watch.
One of the most successful neural network architectures of all time is the LSTM with attention. In the past, machine learning language translation was often done by first using a neural network to encode a sentence into a collection of numbers. Then, a second neural network would decode those numbers into a sentence in the target language.
But it’s unrealistic to expect that a vector of, say, 1,024 numbers can contain all the linguistic information of any sentence it could come across. The LSTM with attention is a recurrent neural network that has the ability to focus on a few words at a time as it works its way through a text (Figure 8).
RL has a similar concept called hierarchical reinforcement learning, which divides a complex task into specialized sub-policies that are organized by a master policy. This could enable RL systems to accomplish goals that are too complicated to achieve with a single policy.
Learning from Hindsight
One of the major challenges in training an RL agent is that the agent might rarely be successful at a task if it is just exploring on its own through random behavior. One way to speed up learning is hindsight experience replay, which retroactively changes an agent’s goal to match what it actually did.
For instance, if we tell an agent to create dialogue in the style of William Shakespeare, but it instead writes something from J. K. Rowling, we can turn this failure into a valuable training sample by changing the original instructions and evaluating the behavior as a positive trajectory.
This is one way to handle the sparse rewards of reinforcement learning. This approach takes success wherever it can find it and trains the RL agent to learn from those examples.
Learning Through Imitation
Agents learn through rewards, which give feedback to an intelligent RL system as to which actions are best in a specific setting. But there can be circumstances where it is nearly impossible to construct a coherent reward function. One example is the hypothetical language evaluation score described earlier: Language is an art, and it would be difficult to create one function that could score all possible variations in writing style.
An intriguing option is inverse reinforcement learning, in which an RL neural network learns what actions are desirable by watching expert examples. In this way, the RL system creates a sophisticated reward function that it can then use to evaluate its own behavior when it begins acting on its own.
Examples of this might be someone moving a robot’s arms to show it how to stack dishes, or giving an RL text system examples of high-quality writing.
One way to train by example could be to incorporate generative adversarial networks (GANs) into reinforcement learning. A GAN consists of two neural networks: a generator, which creates realistic information, and a discriminator, which tries to distinguish between the generator’s fake data and a genuine data set.
In inverse RL, the real data consists of expert behavior, the generator is the RL agent that tries to act like the expert, and the discriminator is a learned reward function that judges whether the agent’s actions are indistinguishable from the expert’s. Incorporating GANs into reinforcement learning could be a way to provide RL agents with a tremendous amount of intelligence before they begin their own exploration.
Learning by Simulation
It can take hundreds, thousands, or millions of tries for an agent to create a successful sequence of actions. One way to quicken this process is to let the agent explore a simulated environment. An example of this is Unity’s efforts to apply their 3D gaming engine to autonomous vehicle simulation.
Simulation programs allow an RL actor to explore new possibilities in a safe environment. Moreover, these simulations don’t have to work on the wall-clock time that human beings experience. Instead, they can run in accelerated simulator time that can even be bolstered by parallel computing. A simulator can therefore provide an RL agent with countless experimental trajectories in a short amount of time, greatly facilitating an agent’s attempts to learn difficult tasks.
In some situations, these simulators could prove to be just as valuable as the math behind reinforcement learning, and we can expect to see the development of a large industry devoted to creating realistic learning environments.
One of the things that make deep RL so appealing is that we don’t have to have a continuous derivative between the inputs and the loss function. For some difficult problems, we might want to go further than the policy gradient methods and train an RL agent in a way where there is no gradient calculation whatsoever.
Derivative-free optimization and evolutionary methods allow an RL agent to learn purely through trial and error, without any mathematical evaluation of its policy neural network. For instance, in one method, an RL agent’s policy parameters are drawn randomly from a multivariate distribution, and the top-performing outcomes are used to adjust the properties of this distribution. Repeating this process many times can lead to a distribution of neural network weights that gives an effective policy.
A disadvantage of this evolutionary approach is that it might require many more sample trajectories than gradient-based methods for the agent to learn a successful policy. These evolutionary techniques might therefore be best suited for environments that have very realistic and efficient simulations.
Artificial intelligence encompasses a constellation of fields such as machine learning, deep learning, robotics, natural language processing, and many other specializations. Each area works to solve challenging problems using its own techniques and terminology, sometimes isolated from the rest of the activity in the AI universe.
Finding ways to merge developing technologies across disciplines could pave the way for even faster AI development than what we’re already seeing. One promising avenue is applying robot-like exploration to tasks that don’t quite fit into the standard deep learning training algorithms. In cases like these, reinforcement learning discovery methods could enable deep learning systems to acquire skills that would be difficult to train with the techniques that are useful in other areas like object recognition.
Reinforcement learning has many qualities that can help us find solutions to complex challenges. Robots learn how to stand up, walk, run, and move boxes. If they can do that on their own, certainly we can use robot-style learning to accomplish great things in areas like human language.
 Bellman equation
Ronald J. Williams
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel
February 19, 2015
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
July 20, 2017
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio
September 1, 2014
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba
July 5, 2017
Nir Baram, Oron Anschel, Itai Caspi, and Shie Mannor
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:390–399
August 06–11, 2017
Jose De Oliveira and Rick Duong
November 14, 2018
Reinforcement Learning Course
Deep RL Bootcamp
University of California Berkeley
“A work of art is an idea that someone exaggerates.”
“Every artist was first an amateur.”
ID 36480720 © Agsandrew | Dreamstime.com
ID 6351691 © Leo Blanchette | Dreamstime.com
ID 39940012 © Abidal | Dreamstime.com
ID 88363004 © Sylverarts | Dreamstime.com
© Matthew Hergott
© Matthew Hergott