What makes up an RL problem?

Dhruv Srikanth
December 26th, 2022

Abstract

Reinforcement learning (RL) is a field of machine learning suited for problems where an agent learns to take actions in an environment to maximize long-term returns. In other word, RL deals with the problem of sequential decision making. As I go through the Rich Sutton and Andrew Barto's book, Reinforcement Learning: An Introduction, I will be exploring some of the concepts and ideas that I come across.

Reference: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

The components that make up a reinforcement learning problem are:

  1. A policy - This determines an agent's next action, given it's current state. This provides a mapping from the current state to the next state at a given time, through an action.
  2. A reward signal - This determines the primary "goodness" of a state transition; a scalar value provided by an action guiding the agent from the current state to the next state.
  3. A value function - This determines the goodness of a state, given the agent's current policy i.e. the long-term reward or return for a given policy.
  4. A model - This determines the next state and reward, given the current state and action. This is used to estimate the value function.

The policy determined by an agent charts a course that maximizes the expected return for the agent. Therefore it seems that RL problems are concerned with searching the policy space to find the optimal policy. The policy space however is determined by the environemnt i.e. the state space and action space. This can often be extremely large. Hence, it seems natural that the next step is to find a way to approximate the value function. This can be done using a neural network. One approach could be to use a deep learning model to approximate the value function, as well as the policy. Other methods can be used such as Monte Carlo methods etc. An alternative to this is evolutionary methods. These methods are used to find the optimal policy by evolving the policy space. If we consider episodic tasks, then neural approximations can be used to determine an optimal policy during an episode, however, evolutainary methods would be used to find the optimal policy across episodes. That being said, neural approximations still learn an optimal policy across episodes by greedily optimizing each episode. Based on the size of the state space, action space and policy space, we can determine which method is more suitable for a given problem. It seems to me that when the policy space is small or state space is small, then evolutionary methods are more suitable as we are searching over the policy space. However, when the policy space is large, then neural approximations are more suitable as a way of efficiently searching the policy space.

A model of the environment is used to perform planning i.e. allows that agent to learn with forsight. This leads to the common trade-off between exploration and exploitation. When do we explore the environment and when do we exploit the environment? In code, we could set this to be a hyperparameter - \(\epsilon\). Consider a random variable \(X\) uniformly distributed between 0 and 1, that determines the probability of exploration. We see that we may explore the environment when \(x \sim X \le \epsilon\) and exploit the environment when \(x \sim X > \epsilon\). Tuning this hyperparameter can lead to several emergent behaviors such as:

  • Setting \(\epsilon\) to be high forces the model to explore more and learn new strategies which could confuse a sub-optimal player in a game.
  • Setting \(\epsilon\) to be low could reduce the variance that troubles RL and reduce overall training time.

In the next post, I will probably explore an example of an RL problem, maybe something fun like a game.

Contact