Reinforcement Learning: An Introduction - Sutton and Barto

Exercise 1

Dhruv Srikanth
January 6th, 2023

Abstract

As I go through the Rich Sutton and Andrew Barto's book, Reinforcement Learning: An Introduction, I will be exploring some of the concepts and ideas that I come across. Here are the solutions I came up with for exercise 1.

Reference: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

Problem 1.1
Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself. What do you think would happen in this case? Would it learn a different way of playing?

This depends on the method for policy initialization. If we consider it to be random for both agents, The agents will learn to exploit strategies in eachother's policy. We can view this as a minimax game. Eventually the agents will converge to a Nash equilibrium where they both play the same strategy or policy.

Another method for policy initialization is to initialize the policy to be the same for both agents. This is the same as the case above. We can make a formal proof by contradition.

Consider an agent converges to a sub-optimal policy. An adversarial agent will learn to exploit this policy. This will result in the converged agent adjusting its policy as a response to the adversarial agent's policy. This implies that the conerged agent has not yet converged on a optimal policy i.e. found the Nash equilibrium. Hence by contradiction, when two agents adversarially engage in self-play, they will converge to the optimal policy.

Problem 1.2
Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the reinforcement learning algorithm described above to take advantage of this? In what ways would this improve it? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

We can use a symmetry function to map the state space to a smaller set of states. Using this smaller set of states, the agent will be able learn an optimal policy faster. We can also choose to focus more on exploitation than exploration.

In the case that the opponent does not take advantage of symmetries, we should not. This is because the opponent will learn to exploit the disadvantage of the agent taking advantage of symmetries. The disadvantage is that the agent will not be able to differentiate symmetric position states. This is why symmetrically equivalent positions should not necessarily have the same value i.e. value functions should necessarily assign the the same value to symmetrically equivalent states.

Problem 1.3
Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Would it learn to play better, or worse, than a nongreedy player? What problems might occur?

Let us consider a two-player minimax game between an agent and an adversary. The agent employs a greedy policy while a adversary employs a non-greedy policy. The agent will never explore the environment as it is always greedy i.e. exploiting the environment. The non-greedy adversary will be able to exploit the agent's policy by learning the environment and adjusting its policy accordingly.

Problem 1.4
Learning from Exploration Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time, then the state values would converge to a set of probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

As a side note, we can consider the step-size parameter to be analogous to the learning rate in gradient descent and the temperature parameter in simulated annealing.

The two sets of probabilities computed when we do learn from exploratory moves are the state-action probabilities and when we do not learn from exploratory moves are the state probabilities.

It would be optimal to learn from a state-action probabilities as it contains more information about the environment.

Problem 1.5
Other Improvements Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

Improvements to the agent that I can think of are:

  1. We can reduce the variance of the agent's policy by first training the agent using an actor-critic algorithm and then use the actor's policy in the self-play setting using the reinforce algorithm.
  2. A high quality dataset with expert knowledge for the critic can help the agent learn off-policy.
  3. We can use approximation methods such as deep learning to learn the policy and the value function.

Contact