Blog: Beginner’s Guide to OpenAI Five at Dota2
This article is about OpenAI Five, a machine learning experiment aimed at creating a team of bots to play against human players at Dota2, a five versus five multiplayer online battle arena (MOBA). We will be exploring it from the gaming perspective:
- Introduction to OpenAI Five and Dota2
- Comparison of Dota2, Chess and Go.
- How OpenAI Five works?
- What we can learn from it?
[Section 1] Introduction to OpenAI Five and Dota2
Recently, OpenAI Five successfully defeated the current world champions in a best of three series and achieved a whopping 99.4% victory rate in OpenAI Five Arena. It is a 4 days event where Dota2 players signed-up to play with OpenAI Five in competitive or cooperative mode.
- Competitive: 5 human players versus 5 OpenAI Five bots
- Cooperative: A mixture of human players and OpenAI Five bots on both teams
OpenAI Five has came a long way in achieving such results. They first started off with a 1 versus 1 match in Dota2, a special game mode which is often played to test individual skill and practice last-hitting. From there, the bot is able of learn the mechanical skills needed such as laning, creep-blocking and killing enemy. After that, they moved on to train a 5 versus 5 match bots. At that time, OpenAI Five was trained to play a strip-down version of Dota2. After that, the team removed some limitations and continue to train the bot until it gets better. However, training OpenAI Five is not an easy task because the patch changes from time to time.
It is undeniable that OpenAI Five is a great breakthrough at AI versus human experiments due to the fact that Dota2 is a very complex video game. Unlike Chess or Go which are turn-based game, Dota2 relies on controlling a “Hero” in real-time. The goal is to destroy a structure called “Ancient” which is located at the opposing team’s base. Throughout the whole gameplay, player can attack or use skills to combat against the opponent in order to gain experience points and gold which can be used to purchase items or buyback (instant resurrect dead heroes instead of waiting for it to respawn in X seconds). Timing and positioning are crucial parts of the game as they often determine the outcome of a teamfight.
Please be noted that the game being played is a modified version of the original Dota2 with some limitations but the gameplay is still the same as the original. The game only allowed the users to choose from a pool of 17 heroes. The original version has 115 heroes.
The modified version will also has pick phase where both teams will take turn to pick their heroes. In the diagram below, we can see that bot and human players engage in hero pick phase before the start of the game.
There are also some other rules regarding the gameplay which differs from the original Dota2.
- Illusion runes won’t spawn.
- Certain items that can summons illusions or units are not available.
Apart form this, OpenAI Five has some advantages compared to human players. The observation obtained are based on the Bot API provided. Although, the information is similar to that of the human player (bot will not know the status of units that are hidden in fog of war just like a player would), bot will be able to remember all of the info it observed. This means it will have an info even if you just reveal yourselves for a split second. Human players might even miss and not remember such occurrence which resulted in bad decision making. In addition, the bot is configured to have 200 action per minute (APM). Professional players usually hover around 200 to 300 APM. However, being able to consistently perform 200 APM can easily outperform a normal human player.
In the gif above, we can see that the bot has an inhuman reaction in performing a combo which even surpass most professional players. Paired with all the information available and the fact that the hero pool limits the strategy that can be executed, OpenAI Five knows when to take a fight and when not to. Thus, it is no surprise that the bots claimed convincing wins over the human players. However, a few groups of players were able to identify the flaws and used it to their advantages to win the game against OpenAI Five. Having said that, OpenAI Five can be scary sometimes and outskill professional players.
OpenAI Five does performed badly in certain tasks such as warding. Warding refers to purchasing of wards and putting them in a location. Wards are items that provide vision to a wide area, allowing you and your team to keep watch over areas of the map and spy on the enemy team. From the gif above, we can see that bot placed multiple wards in the same area and at their own town. This is a bad strategy because you would like to maximize your vision by placing wards at different area and there is no fog of war in your own town.
[Section 2] Comparison of Dota2, Chess and Go
AI beating human players is not something new as there have been a lot of such hype recently. The most notable are Deep Blue in chess and AlphaGo in Go. What makes Dota2 any different from these games?
Long time horizons
The average number of moves to end a game. Since, Dota2 is a real-time strategy game, the moves will be calculated based on frames instead of turn.
Partially observed state of the game
When it comes to chess and Go, both players have the information of the whole board. In Dota2, player do not have the vision for the whole map of the game. In addition, the map is a lot larger than the viewport that can be seen by the player.
This is due to fog of war mechanism. Visions of each heroes are based on the line of sight as well as the altitude of the heroes. Standing on high ground allows you to see those in the lower ground but those on the lower ground will not have the vision of high ground.
High dimensional action and observation space
OpenAI Five observes the game as a list of 20000 numbers and perform actions based on 8 enumeration values making it a very complex game for both AI and human players.
[Section 3] How OpenAI Five works?
Instead of hard-coding the bot, OpenAI Five is trained based on a machine learning technique called reinforcement learning (RL), trial and error using feedback from its own actions and experiences to improve. Reinforcement learning will have the following elements:
- Agent: The controllable unit that can perform certain actions
- Action: Available actions that can be performed
- Environment: Physical world of the game
- State: Current situation observed by the agent
- Reward: Feedback received from the environment. Positive means reward while negative means punishment.
For example, in the game of PacMan, the agent is the PacMan itself. The actions will be the left, right, up, down movement. The environment is the whole game board with the agent, ghost, food, powerup and obstacle. The state will be the current location of the PacMan. The reward will be a positive value each time a food or powerup is consume while negative value if the agent clash with the ghosts. Finishing all the food will be rewarded with a higher value to allow the agent to learn the end-goal of the game.
However, applying reinforcement learning in Dota2 is not as easy as you would have thought due to the fact that a single outcome comes from multiple sequence of actions. For example, to kill a hero, you move to a certain position, perform attack, cast skills and use items. All these actions occur over a few seconds and some complex actions can lasts up to minutes. This resulted in two big problems. First, the agent will never observe any valuable reinforcements and ended up learning nothing as a random action will not resulted in any positive feedback. Second, how does the agent determine which action is responsible for the feedback.
In the gif above, we can see that the controlled Hero perform the following actions:
- Attacked a creep, moved to a position and last hitting the creep.
- Moved high ground and used a skill.
- Chased and attacked an enemy hero until he finally kills it.
Which of the actions are responsible for the kill? Does attacking a creep resulting in the kill?
OpenAI solves this issue by using a custom reward function and shaped each of OpenAI Five’s model as a single-layer network with 1024-unit LSTM. LSTM stands for long short-term memory, an artificial recurrent neural network architecture. The greatest strength of LSTM is the ability to process entire sequences of data making it perfect for classification and predictions problems based on time-series data.
The reward function is meant to maximize the rewards the agent receives in order to win the game. It was classified into the following:
- Individual score
- Building score
- Lane assignment
Please refer to the full reward function in the following link:
From the individual score, we can see how the reward are being defined by the researchers. These scores can be fixed and modified accordingly based on how you wanted to shape the goals. For example, if you would like the bot to emphasis more on killing a hero, you will put a higher score for it. As a result, the bot will learn to kill hero regardless of what it takes. However, we can see that the kill has a weightage of -0.6. The reason is that upon killing a hero, you will gain experience point and gold. The negative weightage is to balance out the gain to prevent the bot from being too biased. The net gain from killing a hero is still positive.
All these rewards will be further processed because Dota2 is a five-man game (OpenAI Five). It is processed in the following ways:
- Zero sum: refers to an outcome in which either “I win” or “You lose”. The resulting score will add up to 0. This ensures that whenever a team attempt to kill a tower to gain 1.0 score, the opponent team will defend it to gain 1.0 score or else they will lose 1.0 upon the destruction of the tower.
- Team spirit: Dota2 is not a single player game. You need to cooperate with the other teammates to win the game. As a result, OpenAI’s team introduced a team spirit evaluation to further tune the reward based on the situations of the other teammates. This forces the agent to take in consideration of its teammate condition before performing an action. For example, both teams are engaging in a teamfight, the agent observed that a teammate is being chased and in low health, instead of attacking the enemy, the agent will use skills or items to save the teammate instead.
- Time scaling: Dota2’s gameplay is categorized into 3 phases, early game (laning phase), mid game (moving around the map to kill enemy) and late game (teamfight and destroying towers to win). In the late game, the agent will have more abilities, items, kills and gold. As a result, the reward will be scaled according to time allowing higher rewards early in the game and lower rewards late in the game. This allows the agent to prioritize and uses different strategies depending on the stage of the game.
OpenAI Five is trained to maximize short-term rewards rather than long term. The latest model values future rewards with a half-life of five minutes. In other words, the agent will not be able to formulate a strategy that relies on long-term benefit. However, item and skill build are not part of the reward system as they are hardcoded into the game. The agent will choose which build to use at random.
Training data via self-play
There is still one big issue that is needed to be solved. How does the OpenAI Five able to obtain the necessary data for training? The answer is simple, self-play. This allows the agent to obtain a vast amount of data that are necessary for training without the need for human input. In the beginning, the heroes will just walk aimlessly around the map hitting some creep. Later on, it will begin to learn how to last hitting creep and killing enemy or towers. With more training data, they will be able to move lanes and perform teamfight. This is like how we learn to walk. We will first crawl around aimlessly. Then, we will try to stand up and balance ourselves. Once we have mastered balancing, we learnt to walk and run. However, performing simulation to train the bot via self-play is not an easy feat. OpenAI used 256 P100 GPUs on GCP to train each agent to play 180 years of Dota2 per day.
[Section 4] What we can learn from it?
All in all, OpenAI Five demonstrated a great breakthrough in using AI in a real-time strategy game. If we were to look at the bigger picture, it is more than just AI beating human in video game. The gist is that AI has the potential to be used in various industry.
From the game developer perspective, they can utilize AI to do game balancing. Letting AI to self-play the game allows the game developers to uncover bugs and various strategies that they never thought of. For example, OpenAI Five has the tendency to perform instant buyback once they are killed even in the early game. Most of the players do not do this unless it is an important teamfight or during the late game as there are penalties which hurt the long-term strategy and the chance to win the game. This provides the game developer some insights on how to formulate different strategies in the game. Furthermore, we can also use these data to test which character are overpowered (OP) and weak. This helps the developers to balance the gameplay by fine-tuning the game.
However, the strategy made by AI can be misleading at times. For example, Dota2 has different roles for the players throughout the whole gameplay. Players usually categorized the roles into carry (responsible to farm and act as the main damage dealer), semi-carry (responsible for tank initiation or roaming around to find kills) and support (responsible to help the carry, support wards and assist in teamfight to kill enemy or save teammate). In this case, only the carry and semi-carry are cores to the team in which killing a core is a lot more important than support. Sacrificing a support to rescue a carry is the norm. However, OpenAI Five does not think it this way, all of them are cores with different job scopes. They will not let a support die unless they thinks that it is not worth it to save the support. As a result, game developers must not be influenced by such strategy when formulating the game designs as the game is meant to be played by human players and not AI.
As for the game players, there is a possibility of having a cooperative AI: Think of it as having AI as your coach, friends or opponents. This allows us to rethink the way our game is played. As more and more games being released in the market. Game players will be saturated over several top games. As a result, certain games or servers will be deserted. When there is no players to play, AI technology can be used to mitigate this issues.
OpenAI Five enables us to see a clearer picture on how to use AI in video games, an actual use case. OpenAI’s team found that binary rewards can give good performance when training an agent. In addition, certain tasks can be learnt from scratch. In their blog post, they mentioned that OpenAI is able to learn from “creep block” in a 2v2 environment from scratch after the agent first learnt it in 1v1 setting. However, there are certain situations where we can’t explain certain behaviors performed by the AI. We can only shaped the goal and the learning process but what the AI learnt and behave are not within our control. Just like how we taught our peers that AI are meant help us and not replace us, at the end of the day, it is entirely up to the individuals to decide and learn whether AI is a friend or foe.
Feel free to check out the following links for more information on OpenAI Five.