Understanding Google’s AlphaZero algorithm in the context of Connect 4.

### Introduction

AlphaGo shocked the world by defeating Lee Sedol in a 5 game series of Go in March 2016. Since then, several newer versions of AlphaGo has been released. In December 2017, the Google DeepMind team released AlphaZero — a general reinforcement learning based algorithm that learnt to play Chess, Shogi, and Go at world class levels after a few hours of training. Amazingly, the same network architecture was used to learn to play these 3 very different games. While AlphaGo had parts of the code optimized for the game of Go, AlphaZero removed any such specificities from the training; the network was only aware of the rules of the game.

This article details the results of my research on the AlphaZero algorithm with the goal of applying it to Connect 4.

### Connect 4

Connect 4 is played on a rectangular board with 6 rows and 7 columns. The board is oriented vertically, and players slide red and yellow pieces into the board from the top. The aim of the game is to connect four pieces in a row, either horizontally, vertically, or diagonally.

Connect 4 has been mathematically solved. It was first solved in 1995 via brute force methods, then subsequently via methods such as negamax and minimax. The conclusion for the game is a first player win: if the first player plays in the middle column, the player can force a win on the 41st move. If the first player plays on the outer 4 columns, the second player is able to force a win.

While there exists a perfect solution to the game, I wish to train a model based on reinforcement learning to learn to play the game. The existence of the perfect solutions mean that there exist perfect benchmarks to work toward.

### AlphaZero

The overarching idea of AlphaZero is a neural network training by using a new form of reinforcement learning. Starting from scratch, AlphaZero plays games against itself and gradually learns to become stronger and stronger. This process can be broken down into 2 main parts: Playing and Training.

#### Playing

To generate data to be trained on, AlphaZero plays against itself. This allows AlphaZero to continuously generate high quality data based on itself, rather than existing games played by humans or bots.

AlphaZero’s neural network takes in the game state as an input, and returns a `vector` and a `value`. The `vector` represents probabilities to take each possible move, and the value estimates the strength of the position. For instance, a `probability vector` with values `[0.2, 0.1, 0.5, 0.1, 0.0, 0.04, 0.06]` indicates that there is a 50% chance of taking the 3rd move. Illegal moves have their probabilities set to 0 in the vector. The `value` ranges from -1 to 1, with -1 predicting certain loss and 1 predicting certain victory. Draws have a score of 0.

The neural network is used in conjunction with a Monte Carlo Tree Search (MCTS) to select moves to play. To select a move to play, the game state is used as a root node to perform a MCTS. The MCTS traverses down the tree, selecting one move from each state it passes, until it reaches a leaf node — a previously unvisited state. The move selection favours moves which have a low visit count, high probability, and high value; this is described in more detail in later sections.

In classical MCTS, when a leaf node is reached, a simulation is played from that point to the end of the game, with randomly chosen moves. Depending on the results of the simulation, the values stored in the tree are updated.

With AlphaGo, instead of doing a simulated playout, the neural net is used to score the leaf node and obtain probabilities for exploring its child nodes. The `value` of the leaf node is transmitted up the tree, updating the nodes that were involved in reaching that state. The leaf node is then expanded, using the `probability vector` from the neural net to initialise the child nodes.

After going through many iterations of the MCTS algorithm, a probability vector is generated for the possible moves from the root node. The vector is calculated by the proportion of the number of times each move is selected during the MCTS.

A move is picked stochastically, using the distribution dictated by the probability vector. MCTS is then ran once again for the next game state, until the game ends. The game states and probability vector produced by the MCTS is stored. When the game ends, the final result is appended with each game state and probability vector. In other words, each data item consists of `(game state, probability vector, final result)`. The final result is +1 for a win, -1 for a loss, and 0 for a draw.

#### Training

As mentioned previously, AlphaZero plays against itself and learns. Its neural net outputs a `probability vector` and a `value`. How is the reward function defined then? How are the “correct values” defined for the network to train toward? AlphaGo applies Monte Carlo Tree Search (MCTS), combined with outputs from the neural network, to determine the “actual” value and probability vector of each game state.

From the Playing section, we see that the neural net is used as a heuristic in the Monte Carlo Tree Search, as a replacement for the simulated rollouts. Because MCTS searches through the next few steps in the game and aggregates the overall value of selecting each move, the `probability vector` that MCTS returns will almost always be better than the neural net’s predicted vector for the original game state at the root.

Hence, the returned `probability vector` from conducting MCTS is used as the “correct values” to calculate the loss from the neural net. The final result (+1, -1, or 0) from the game is used to calculate the loss against the value output of the neural net.

Here, 𝑧 is the final result from the game, while 𝑣 is the `value` output from the neural net. Thus, the first term, (𝑧−𝑣)², is the mean-squared error for the value.

𝝅 is the “ground truth” policy, the `probability vector` returned by the MCTS. 𝐩 is the `vector` output from the neural net. The second term is the cross-entropy loss between the two vectors.

The last element is a regularisation element, with 𝑐 controlling the level of L2 weight regularisation to prevent overfitting.

### AlphaZero: Further Details

#### Monte Carlo Tree Search

Each node in the MCTS contains 4 variables: 𝑃(𝑠, 𝑎), 𝑁(𝑠, 𝑎), 𝑊(𝑠, 𝑎), 𝑄(𝑠, 𝑎).

• 𝑃(𝑠, 𝑎) — Probability that action/move 𝑎 will be picked given state 𝑠.
• 𝑁(𝑠, 𝑎) — Number of times the (𝑠, 𝑎) pair (the node) is visited by MCTS.
• 𝑊(𝑠, 𝑎)— Total value accumulated by this node.
• 𝑄(𝑠, 𝑎)— Mean value of this node; Quality of the (𝑠,𝑎) pair.

When a leaf node is expanded, it initializes the children nodes with their move probabilities. The value 𝑣 assigned to the leaf node by the neural network is propagated upwards through the MCTS search tree by the following equations:

• 𝑁(𝑠, 𝑎) = 𝑁(𝑠, 𝑎) + 1
• 𝑊(𝑠, 𝑎) = 𝑊(𝑠, 𝑎) + 𝑣
• 𝑄(𝑠, 𝑎) = 𝑊(𝑠, 𝑎) / 𝑁(𝑠, 𝑎)

Move Selection in MCTS
While traversing from the root node to the leaf node, the algorithm selects moves at each node. This selection is designed to favour moves that have lower visit counts, higher probability, and higher value.

The move — also called action — is selected by a variation of the PUCT algorithm:

𝑄(𝑠, 𝑎) is the quality term stored in each node, indicating the mean value of this node. 𝑈(𝑠, 𝑎) is a term that depends on the probability of the move, as well as the frequency that the node is visited by the MCTS algorithm.

𝑈(𝑠, 𝑎) is proportional to 𝑃(𝑠, 𝑎) and the root of 𝑁(𝑠), with 𝑁(𝑠) being the visit count of the parent. As mentioned before, 𝑁(𝑠, 𝑎) refers to the visit count of the node itself.

Thus, we see that if the parent has been visited many times but this node has not been visited a lot, the 𝑈(𝑠, 𝑎) term would become larger.

#### Neural Net Architecture

The network used by AlphaZero consists of a body and two heads. The body is a “residual tower”; a stack of residual blocks. The body then splits into the policy head and the value head, which outputs the `probability vector` and `value` respectively.

I think of it as the body transforming the input game state onto an abstract feature space, and then from the feature space learning to suggest a distribution of moves to pick from or evaluate the value of those features.

The body is made up of 19 residual blocks, each of which consists of 2 rectified, batch-normalized convolutional layers with a skip connection: `convolution > batch norm > add back initial input > relu`.

The policy head has one rectified, batch-normalized convolutional layers, followed by a linear layer of size 19×19 for Go. Go has a board size of 19×19; each element in the linear layer represents the probability to play a move at that position on the board.

The value head applies a rectified, batch-normalized convolution of 1 filter of kernel size 1×1 with stride 1. This “squashes” the number of channels to 1. This is followed by a rectified linear layer of size 256 and a tanh linear layer of size 1. The tanh layer ensures that the value output is between -1 and 1.

All the convolutional layers, except for in the value head, use 256 kernels of size 3×3 with stride 1.

#### Game State Representation

The representing the state of the board is a non-trivial task and a surprisingly complex one. To encode all the possible moves for chess, for instance, many extra planes need to be added to the game state representation. For the sake of this project, I will not be going into the details of game state representation in Chess or Shogi.

In the game Go, like Connect 4, the pieces do not move after being placed on the board. Thus the game state representation is less complex in terms of encoding possible moves.

The game state representation used in AlphaZero represents the current state of the board, along with the 6 previous states of the board (6 moves prior to current time t). At each time, the state is separated into two planes that indicate the presence of each player’s pieces.

This image shows an example of how a Connect 4 board would be represented as a game state. The board is split into the pieces of each player, and the presence of their pieces are represented with a 1. Otherwise, the rest of the board is filled with 0s. There would be 7 pairs of the red pieces and yellow pieces, followed by a plane at the bottom that is either all 0s or all 1s, to indicate which player’s turn it is.

#### Exploration

As in all RL problems, there has to be some way to encourage exploration to allow discovery of new states. Besides move selection within the MCTS, AlphaZero encourages exploration through move selection during the game and addition of noise in MCTS.

Move Selection
After obtaining the probability vector from the MCTS, AlphaZero selects a move stochastically. To encourage more exploration, a temperature term is applied such that:

A higher value of 𝜏 would “flatten” the distribution over the moves, encouraging moves that have lower probability to be picked more.

In each game, AlphaZero uses a temperature of 1 for the first 30 moves, then anneals it to ~0 thereafter.