### Blog: Open Minded AI: Improving Performance by Keeping All Options on the Table

## How I made my Reinforcement Learning agent perform better by making it stop going naively after the highest reward

*The Q-Learning and Max-Entropy-Q-Learning algorithms discussed here are implemented in my *`warehouse`

* library. The Tic-Tac-Toe game described, as well as the trained models, can be found in my *`tic_tac_toe`

* repository on **my GitHub page**.*

When I’m being asked to describe what fascinates me so much about Reinforcement Learning, I usually explain that I see it as if I train my computer in the same way I trained my dog — using nothing but rewards. My dog learned to sit, wait, come over, stand, lie down and pretend to be shot at (kudos to my wife), all in the exact same way — I rewarded her every time she did what I asked her to. She had no idea what we wanted from her every time we tried to teach her something new, but after enough trial-and-error, and some trial-and-success, she figured it out. The exact same thing happens when a Reinforcement Learning model is being taught something new. And that’s absolutely amazing.

Following the highest probable reward is the fundamental engine that lies underneath pretty much all Reinforcement Learning models — each one simply has its own method to accomplish it. But when examining this approach from a higher perspective, following naively after rewards is quite narrow-minded. Following such strategy will prevent us from being able to quickly adapt to unexpected changes, as we never maintain a Plan-B. Our entire exploration phase is performed only for the sake of finding the *best* possible way, with far less attention to other good options. Can we teach our models to open up their minds? And should we?

#### Maximizing Entropy Rather Than Rewards

Not too long ago I stumbled upon this blogpost by the Berkeley AI Research Center (which summarizes very briefly this paper), which suggests a novel method for learning — instead of learning the path which provides the highest reward, follow the one that provides the most positive options to choose from. In other words, teach your model to increase its *action-selection **entropy*.

Here’s an example: let’s consider a simple Q-Learning algorithm, and examine the following scenario:

Assume we are standing in state *s*, and can choose from one of two actions: action *X* and action *Y*. From there we will reach state *s’*, where three options will be possible: *1*, *2* and *3*, which will take us to the terminal state and we receive a reward. We see that if we stand in *s’* after selecting *X*, our policy will be to choose action *2* no matter what. On the other hand, if we reached *s’* after selecting *Y*, we can be a bit more flexible — though action *2* is still the best. In other words, the policy’s *entropy* of *s’* after *X* is very low — as it focus solely on a single action, while after *Y* it’s higher — as it can afford trying all state with a reasonable probability.

But why bother? It’s clear from the diagram that the optimal option will be action *2 *from state *s’*. True — but what will happen if there’s a sudden change in the environment? A bug, a modification, or an unexpected action by an opponent? If such a change will suddenly prevent us from taking action *2*, then action *X* becomes the wrong decision.

But there’s more to that than just paranoia. Just a few lines ago we agreed that after choosing action *Y* we can be more *flexible* in the next action selection. While there’s still the *best* option, the others are not too far off of it, and this can allow the model to explore these other actions more, as the price payed for not choosing the optimal one is low. That cannot be said about the same scenario after choosing action *X, *and as we know — sufficient exploration is vital and crucial for a robust Reinforcement Learning agent.

#### Let’s Talk Business

How to design a general policy that encourages an agent to maximize entropy is presented in the paper I linked to above. Here I’d like to focus on the *Soft Bellman Equation* (discussed in the blogpost I referred to). Let’s first refresh our memories with the regular Bellman Equation:

The Soft Bellman Equation will try to maximize *entropy* rather than future reward. Therefore, it shall replace the last term, where we maximize over the future Q Value, with an entropy-maximization term. And so, in the case of a finite number of actions, the Soft Bellman Equation is:

If you’d like to see the mathematical proof for how this new term relates to the entropy, the blogpost authors claim it is found in this 236 pages Ph.D thesis by Brian Ziebart. If you’ve ever taken a Thermodynamics class once, you can get the general idea if you recall that the thermodynamic entropy of a gas is defined as *S=k⋅lnΩ*, where the number of configurations, *Ω*, at equilibrium, is *Ω≈exp(N)*, where *N* is the number of particles. If you have no idea what you’ve just read, you’ll just have to trust me (or read the thesis).

#### Does It Work?

If you’ve read some of my blogposts so far, you might have noticed that I like to test things myself, and this case isn’t different. I’ve wrote once about a Tic-Tac-Toe agent I trained using Deep Q-Networks, so I decided to edit it and allow it to learn using the Soft Bellman Equation too. I then trained two agents: a regular Q-Network agent and a Max-Entropy-Q-Network agent. I trained two such agents playing against each other and then two other agents playing separately against an external player — and repeated this process 3 times, ending up with 6 different trained models of each type. I then matched all regular Q-Network agents with all Max-Entropy-Q-Network agents to see which type of agent wins the most games. I also forced the agents to select a different first move each game (to cover all possible game-options), and made sure they both get to play both *X* and *O*.

The results are very clear: of the 648 games played, the Soft Bellman agents won 36.1% of the games (234), 33.5% ended in a tie (217) and only 30.4% of the games (197) were won by the regular Q-Network agents.

When considering only the games played without me forcing the agents to make a specific first action, but rather letting them play as they wish, the results were even more in favor of the Soft Bellman agents: of 72 games played, 40.3% (29) were won by the Max-Entropy-Q-Network, 33.3% (24) were won by the regular agents, and the rest 26.4% (19) ended with no winner. I encourage you to run this experiment yourself too!

#### Final Words

This experiment have demonstrated that while learning complex systems, and even not so complex systems, following broader objectives other than the highest reward can be quite beneficial. As I see it, teaching a model such a broader policy, is as if we no longer treat the agent as a pet we wish to train — but a human we’re trying to teach. I’m excited to see what more can we teach it in the future!

## Leave a Reply