### Blog: Key Learnings on Implementing PPO for Unity Crawler Environment

### About the Environment

The unity Crawler is a 12-agent environment of identical agents to run in parallel so that more information can be collected during the same running time period.

The agent is a simulated creature with 4 arms. Each arm has 2 joints. At each step, the environment output a 129-dimensional state vector. State vector is continuous. The agent will evaluate the state and then output an action vector of 20 dimensional continuous values. The agent is rewarded for going to the right direction as set by the environment. Specifically, a reward of +0.03 times body velocity in the goal direction and +0.01 times body direction alignment with goal direction will be given to the agent.

### PPO

Proximal Policy Optimisation Algorithm (PPO) method was chosen for this task. Here is a summary of the key characteristics of PPO:

**PPO is an online learning method**. The whole idea of PPO rest on the assumption that currently running policy is not too different from policy ran when the trajectory data collected hence previously collected trajectory data can be ‘reused’ for the model update and learning. In order to achieve this, the “memory” function has to be modified such that only the *most recent* experiences are stored. In particular, the memory is not super huge and that memory will be purged once a learning is happened. Note that this is different from the ordinary DDPG memory handling method where the memory size is usually much larger and that DDPG has a higher tolerance for older experiences.

**PPO works on the probability of outputting an action given a state. And this hence it does not fit directly with environment which requires a continuous actions. **Usually for environment which requires a continuous actions, a deterministic approach will be used where the model will output directly the recommended actions. However, since PPO requires probability of outputting an action (given a state) to update the surrogate function, we somehow needed the model to *output both the probability and the action itself*. To work around this, a continuous action is first outputted by the network. The action is then used as the mean of a distribution and a new probability distribution is generated. Then a ‘resampled action’ is taken from the new distribution. This way this distribution created can indirectly allow us to access the probability of coming up with a particular action which can be used for PPO calculation, and at the same time, a continuous action can be obtained.

**The surrogate function is clipped** so that there is only a certain ‘safe’ probability ratio range for the model to update. This is done so to avoid the model from falling into ‘gradient cliff’ where gradient is very close to zero that the model will have a difficult time to get updated.

**Entropy term is added for exploration**. Since we are dealing with probability, we cannot directly add values to the actions just like what we used to do with DDPG to create noise and therefore encourage exploration. To allow the model to explore more, an *entropy bonus *is added to the loss function which basically change the distribution of probability and hence allow a higher chance of sampling exploratory actions. Here is a very good blog post summarizing how entropy term can help.

**Adjusting the Time Horizon (T_MAX) parameter**. In the PPO paper, it is mentioned that the algorithm will only collect trajectory data up to a certain time-step T_MAX. This T_MAX is usually much smaller than the actual number of steps needed to complete an episode BUT is comprehensive enough to include all important situations facing the agent. For example in this case, a T_MAX of 512 was chosen while in most of the time, while in most of the time the agent will require a step total of around 1000 to complete an episode. The idea here is that in some environment, particularly recurrent environment where there is always new data coming in (like stock price estimation environment, for example), it is possible that there is no definitive ending for an episode and that an episode could run forever. Thus a T_MAX implementation is to tackle situation like this.

**Advantage Function is used to evaluate the attractiveness of taking a particular action given a state**. It is needed for 2 reasons. First, we need a way to distinguish a ‘good’ action as versus a ‘bad’ action. Secondly, we need an estimator to approximate the value of a state when T_MAX is reached. Here the advantage function is the NET value of taking an action and is estimated by summation of discounted actual reward of a trajectory after an action is taken minus the estimated value of the current state at time-step 0. If the trajectory goes beyond T_MAX, an estimation is made of the value of the last state is made at time-step T_MAX. Note that the role of the critic network here is to make sure the estimation of these values are as close to actual as possible.

### Hyperparameters summary:

### The Training Result:

After soooooo many different trial and errors, I am glad that I am finally able to reach an average score of over 2000 (per episode) across all 12 agents around episode 453th.

Question? please leave a comment of send me an email at samuelpun@gmail.com

## Leave a Reply