Blog: Contemporary Classification of Machine Learning Techniques (Part 3)
In this third installment of the series titled Learning Machine Learning, we will be discussing the final class of machine learning techniques: reinforcement learning. If you would like to read my take on supervised / unsupervised learning, you can read it here and here.
Contemporary Classification of Machine Learning Algorithms
As mentioned, today’s Machine Learning Algorithms can be segregated into one of the three classes, supervised learning, unsupervised learning and reinforcement learning. This segregation is chosen because of the way these algorithms learn the machine learning model. In particular:
- Supervised learning: learns the model by example
- Unsupervised learning: learns the model from data
- Reinforcement learning: learns the model by experience
Reinforcement Learning: Learning by Experience
During our earlier discussions, we mentioned that supervised learning usually consists of labelled data whereas unsupervised learning works with data without training labels.
You can consider reinforcement learning (RL) to be different because you do not start with any data. Rather you begin with an understanding of the problem you are trying to solve.For instance, if you are looking to solve a navigational problem of getting from Point A to Point B using reinforcement learning, you will have to determine the following:
- The agent
- The reward the agent obtains by getting to Point B. This may involve having various intermediary reward milestones between Point A and Point B.
- The environment (state) and how the agent observes the environment which Points A and B resides in.
- The ways an agent can act within the environment to get from Point A to Point B. (eg move up, down, left, right)
In this navigational example, the agent resides within an environment and its objective is to get from Point A to Point B. The agent achieves this objective by taking actions. These actions taken by the agent may result in rewards to motivate the agent when he is making the correct choices. These actions may also change the state of the environment. Finally this new environment state is observed by the agent who then decides what is the next best action to take. This is summarized by the image below.
Machine Learning with no Data
Let us use this routing example as a simple illustration how RL might work. We start with no data but the agent will generate multiple routing possibilities of getting from Point A to Point B through trial and error. Assuming that the actions we can take are:
The reward is given when Point B is reached. Assume also in this situation, the agent can only take four steps before the game is over. The agent then does a series of actions which might look like:
What then should be the result of row 3? Well it really depends. If we assume that the agent is in a maze with only three horizontal squares and the state of the environment before Action 1 is always the following:
Then it becomes clear that what matters most is the agent having to move to the right two times in order to reach Point B (The actions of the agent moving up or down becomes irrelevant). The actions taken for ID 3 leads the agent to Point B.
Assuming if we have an alternative maze environment as such:
Then what matters most is the agent needs to travel down once, and travel right twice in order to get to Point B. The actions taken for ID 3 leads the agent to Point B.
In another alternative environment, suppose we now have areas in the maze where the agent is unable to travel to (darkened areas). In this case, ID 0 and ID 2 navigates the Agent from Point A to Point B, but ID 3 is unable to. (You would require the agent to first move to the right before moving down.)
In RL problems, we are not aware of the environment. The challenge then in RL is how do we learn what is the next best action to take in order to maximize rewards. This is done by the generation of data through this iterative process of trial and error where the agent observes the environment and takes the next best action.
There are two main methods this can be learnt from the generated data and the methods are Q Learning (Value based learning) and Policy Gradients. To explain these two methods, I will be solving an introductory game from OpenAI gym, known as CartPole-v0. My approach will be to attempt explaining these methods from a programming perspective, and hopefully from there, we gain the intuition for either methods.
A pole is attached to a cart. When a game episode starts, the pole is upright. The cart is able to move either to the left or the right. At each timepoint, the cart is required to move either to the left or to the right. Moving the cart affects the pole. The goal is to maintain the pole within a certain angle for more than 195 timepoints over 100 consecutive episodes. The intuition is if the RL algorithm is able to balance the pole more than 195 timepoints over 100 episodes, it means the RL algorithm has figured out how to properly play the game.
Reward: The agent gains a reward of +1 every time the pole is kept upright.
Actions: The cart can only move to the left or right.
Observations: The agent observes the environment through these four values:
Further details can be found here.
The intuition for Q Learning is the following: given certain observations, what is the expected reward for each action the agent can take. Hence Q Learning is akin to building a huge lookup table, where at every single observational point, the table tells you the expected rewards for taking any of the possible actions. Let’s break this down in these five details:
Detail 1: Main loop
At its core, the main loop for Q-learning goes simply like this:
- Pick the next best action to take
- Take the action in the environment
- Observe / discretize the new state in the environment
- Update the Q-tables
- Repeat until done
Detail 2: Pick the next best action to take
There is a concept in RL known as exploitation versus exploration. This means for every timestep, the agent can either carry out an action where it currently deems to be the best move (exploitation), or the agent can carry out a new move (exploration).
A technique of doing this is via greedy-epsilon method. Greedy-epsilon generates out a random number and checks if its smaller than a variable (epsilon). If it is, the random action (exploration) is used. Else the best action for the observation is used (exploitation).
One final point: at the very beginning, the Q-tables are largely empty, hence you would want to agent to be carrying out more explorational actions. However you would want the agent to become more and more exploitative as the Q-tables start filling up by increasing epsilon over time.
Detail 3: Discretize the observational space
As we had seen earlier, the observations returned from the environment are in continuous numbers. However in the creation of Q-Tables, it is easier to convert these numbers to the discrete form. This is achieved by the following function:
This elegant way of discretizing is credited to this article.
Detail 4: Update the Q-tables
Finally we update the Q-Tables using the formula:
The learned value is the estimated future reward with the new action. The learned value consists of two parts, the current reward as well as a discounted future reward. This discounted future reward an estimate of the optimal future value after taking the action multiplied by a discount factor (to account for it being a future reward).
The alpha variable controls the speed where the existing values of the Q-Table are updated.
Detail 5: Update alpha
The final step updates the learning rate alpha. Similar to epsilon, in the beginning you may want the updates to be more aggressive. However as time progresses, you probably want to scale back on the updates to make your table more stable.
The gist for solving CartPole with Q-Learning can be found here:
CartPole-v0: Policy Gradients
The intuition for policy gradients is the following: given certain observations, what is the best policy for the agent to act. Policy gradients are slightly more complex but it is worth noting that recent achievements in RL are attributed towards policy gradients.
Programmatically, you can imagine that there is a main loop which figures out what the policies should be. In this case, we will be using a neural network and a main loop to find out these policies. Hence we will be describing the neural network and the three main details within this loop:
- Neural Network
- Forward Pass (calculate the output actions from the neural network)
- Calculate Rewards (calculate the training data to be fed into the neural network)
- Loss Function
- Backward Pass (update the weights of the neural network)
Detail 1: Neural Network
We show the neural network we build. It consists of two hidden layers of 32 nodes.
Detail 2: Forward Pass
Detail two feeds into the neural network the current state of the environment and generates out the action probabilities. We call a random choice (based on the action probabilities provided by the neural network) to decide which action we take.
Detail 3: Calculate Rewards
In detail two, we generate a list of actions from the forward passes to be fed into the environment. When we feed these actions to the environment, we end up with a list of rewards due to this list of actions.
From the list of actions, we calculate a vector known as a running reward. This running reward is calculated by taking into account the rewards earned from this current timestamp all the way to the end point. However the rewards earned further away will be discounted.
Detail 4: Loss Function
To describe the loss function of policy gradients, we first describe the negative log likelihood function of supervised learning as
It is akin to maximizing the probabilistic term at the left hand side of the equation. For policy gradients, the formula is similar, but with the addition of a term A which is known as the advantage. In RL, the advantage is replaced by the cumulative discounted rewards.
Hence it gives us the following implementation where we take the log of:
- the output of the model (probability of an action taken) is multiplied by
- the actual action taken
This logarithmic term is further multiplied by the discounted reward.
Detail 5: Backward Pass
Finally, we use the Adam optimizer to update the weights of the neural network based on the log loss function.
The gist for solving CartPole with Policy Gradients can be found here:
Commercial Uses of Reinforcement Learning
Andrew Ng has been quoted as saying that the excitement and PR hype behind reinforcement learning is a bit disproportionate relative to the economic value it is creating today (source). Having said that, there has been some specific industries that has seen success in the application of RL. Examples are:
- Finance (Trade Execution in JP Morgan Chase)
- Control / Robotics problems (traffic light controls, data centre cooling navigation)
- Real time sponsored search bidding (Alibaba)
You can tell that these applications all share the common trait of RL. Namely a way of defining the problem as an agent, observing and taking actions within an environment with very clear reward outcomes (For instance data centre cooling rewards cost savings; traffic light controls rewards decreases in delays / congestion).
This concludes the three part introductory series on Contemporary Classifications of Machine Learning. I hope you have enjoyed reading them as much as I had writing them!
The author is an adjunct professor at Singapore Institute of Technology (SIT). He holds a PhD in Computer Science from Imperial College. He also has a Masters in Computer Science from the NUS under the Singapore MIT Alliance (SMA) programme.
The views in this article are that of the author’s and do not necessarily reflect the official policies or positions of any organizations that the author is associated with. The author also holds no affiliations nor earns any fees from any products, courses or books mentioned in this article.