Blog: The Almighty Policy Gradient in Reinforcement Learning
A simple step by step explanation to the concept of policy gradients and how they fit into reinforcement learning. Maybe too simple.
In this post, I am explaining how we derive the fundamental policy gradient theorem. Why should you know something about the policy gradient? Knowing how the policy gradient is defined is going to give you the tools to understand 90% of reinforcement learning algorithms and papers out there. Papers that do stuff like this:
So if you are interested in reinforcement learning, I highly recommend that you engage with the subject of the PG, because it opens millions of doors. Here is the mantra:
I know where the policy gradient comes from, I can understand most of reinforcement learning.
It is important to differentiate between the policy gradient algorithms (where we use the policy gradient) and the policy gradient which we can define through simple math. More often than not, I see people get confused about these two concepts so I felt the need to state it plainly on the beginning:
The policy gradient is the basis for policy gradient reinforcement learning algorithms
Now, there are also other kinds of reinforcement learning algorithms that have nothing to do with the policy gradient. But the policy gradient algorithms are cool because we get an actor(policy) directly from our optimization.
To build a house, you need to start with the foundation.
What is our goal in reinforcement learning? The goal is to maximize the return J of our policy π, naturally, our policy is parametrized with some kind of parameters which we will denote θ (theta) here. So, to write it down plainly, the goal is this:
We omit writing π since we stated that π is parametrized by θ, this is just a writing convention. The equation states that we want the parameters that maximize the return — duh! Nothing serious yet. Note that this definition is universal in reinforcement learning, we can pack all sorts of stuff into these parameters that we want to optimize, these parameters can also refer to the Q value function parameters, then we would be looking at value-based methods but in this article, we concentrate on the policy gradient. Now that we know what the return of the policy is, we can start talking about it in a more meaningful way.
First, it is important to note that in reinforcement learning we deal with Markov Decision Processes or Partially Observable Markov Decision Processes, from now on denoted as MDPs and POMDPs. MDPs and POMDPs are a nice framework for formalizing dynamical systems. Basically, we can define each transition in our system as the following:
So, put plainly this says that the probability of next state given current state and action, so we know also that MDPs are stochastic, we might not be certain in which state we are going to end up in in the next step. Notice that there is only a notion of 1-step history here, the new state only cares about the previous state and the action, this is also called the Markov property or the Markov assumption. This holds for MDPs In POMDPs the next state is conditioned additionally on some hidden state that is not observable, which make them more difficult but also more universal. So, let’s say it loud and clear:
The Markov assumption states that the next state is only dependant on the current state and action.
Now that we understand where MDPs are coming from and have a framework to look at environments (dynamical systems), we shift back to defining the policy gradient. By now we know what is the return of the policy and we know how to write transitions in the environment. Basically, we can expect that we are going to have data from many different situations from different starting positions (trajectories). What is the return of the trajectory τ (tau)? Well, it is the sum of all the rewards in the trajectory, very simple:
This r function is the step reward that you would normally receive from your environment. I have used a sum here, so we assume that the environment is discrete in steps(not the states and actions though). Sometimes we add a discount factor between 0 and 1 to this equation to make an infinite sum finite, but that is not strictly necessary with finite trajectories. It is important to know that you may or may not know how this reward function looks like analytically, this is the beauty of reinforcement learning.
Nice, now we know what the return of a trajectory is. But we know also that our environment is stochastic! Therefore our trajectory becomes a random variable (independent of the fact if the policy is stochastic or not). Hence, our return becomes a random variable! So what do we maximize when we maximize a random variable? We maximize the expectation of this random variable, or in our case, the policy’s return:
So the expectation is just a weighted sum over all the possible realizations of the random variable where the weights are the probabilities of the realization happening. This makes perfect sense, you would want to weight something that is more probable more, therefore the returns of more probable trajectories carry more weight in our expectation (perfect sense!). For those of you having problems with calculus — do not get scared of the integral, it is basically a sum over the expression in it when iterating over all values of a continuous variable (infinite number of values), the continuous variable is our trajectory in this case.
Notice another thing, the trajectory probability is a conditional probability conditioned on the policy parameters. This is because the actions depend on the policy, the next state depends on the current action and current state, hence the trajectory depends on the parameters and environment dynamics. Voilla! Perfect sense. This is how we can write out this conditional:
We can only write this conditional probability as the above because of the Markov assumption. That is why it distills to the nice product above that is easy to work with. So far so good… “Autonomous” robots are waiting.
Finally the juicy part!
Ok, now we understand the conditional probability of the trajectory, what is a trajectory, what is the trajectory return, what is the policy’s return, what is the expected policy return… Now we can actually start talking about the optimization part and kick the math up a notch.
So, we want to maximize our return and we want to do it by gradient descent. We need a gradient of the policy’s return with respect to the policy’s parameters θ.
In this situation, the gradient can go into the integral, so we come to the following equation:
Alright, so the return depends on the resulting trajectory but not on the parameters directly. We only need to worry about the conditional probability term in the integral. Let’s write it out based on what we have written so far:
Taking the gradient of the product is really ugly, we don’t really want that, we want something nicer, we want a sum. What can we do? Some of you have heard of the log-derivative trick. The insight is the following (and yes, it is really that simple) and is based solely on the way that you take the gradient of the logarithm:
Going back to our policy gradient. Now we can write it as the following:
We expand the term in the logarithm, note that the result is that we can write the product as a sum of logarithms(typical sum of logarithms rule):
Based on standard calculus, the gradient operator slips into the sum, the parts that do not depend on the parameters go to 0 and policy log-likelihood remains:
The result is an expectation over trajectories. Now we can write the glorious end result, or otherwise called the policy gradient theorem:
The Greek letter ro under the expectation indicates our trajectory distribution resulting from our policy and environment dynamics with regards to which we calculate our expectation. This distribution can cause a lot of headaches since in order to get good estimates of the expectation we need samples that represent our distribution really well. This is exactly where off-policy policy gradient algorithms fall short since they use data from different behavioral policies that induce a different distribution resulting in a bad estimate of the policy gradient. Coincidentally, I have written an article about this: The False Promise of Off-Policy Reinforcement Learning Algorithms. Read it if you are interested in more detail.
We cannot compute this expectation directly, it would require us to compute all possible trajectories for the given policy. What we can do is to estimate it with Monte-Carlo sampling. This is just a fancy way of saying that we sample some trajectories for the given policy and take the average as the estimate of the expectation. We can write it as the following with slight abuse of notation (M indicating the number of trajectories that we use for estimation):
And that is basically all there is to it. This is naturally a better estimate the more samples we have and, off course if the samples stay true to the inducing distribution based on the current policy that we are optimizing. Put in other words, is the distribution induced by our current policy similar to the distribution that the samples are drawn from? To alleviate this, researchers have employed clever tricks such as importance sampling, surrogate loss functions and so on which I won’t cover here, perhaps another time.
Now that we have finally come to the end of the post, looking back at it… I believe that, if you understood all of the concepts, now you have a (very) solid foundation for understanding policy gradient algorithms and you can attempt to digest a few reinforcement learning papers as well! As a next thing to read perhaps you want to know about why off-policy reinforcement learning doesn’t work well:
Hope it was interesting and intuitive! Good job if you understood everything, if not… Not a biggy, most people don’t get it at first. Just work through it again. Not that you have the tools, the reinforcement learning universe is open… Have fun!