Gradient Bandits, Energy Based Models and Contrastive Learning

Dhruv Srikanth
January 8th, 2023

Abstract

Gradient bandits are a class of algorithms that use gradient information to solve the multi-armed bandit problem. In this post, I will explore the connection between gradient bandits and energy based models. I will also discuss the connection between gradient bandits and contrastive learning.

Reference: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

As I go through the Rich Sutton and Andrew Barto's book, Reinforcement Learning: An Introduction, specifically section 2.7 on gradient bandits, I came across the update rule for action preferences in the gradient bandit algorithm. This update rule, similar to the update rule in stochastic gradient descent was reminiscent of models that can be analyzed under the energy-based framework.

The n-armed bandit or multi-armed bandit problem is a classic problem in reinforcement learning. Simply put, for a given time step, the agent is presented with a set of actions (n or otherwise). Upon choosing an action, the agent sees the true reward. The agent must choose the action that maximizes its expected long term reward i.e. the return after several steps. An example of this is a slot machine. Given a slot machine with one lever (one arm) the agent can pull the lever and see the reward. The agent can replay the game several times. This falls under the one-armed bandit problem. Now consider a slot machine with n levers. The agent must choose which lever out of n levers to pull at each time step, so as to maximize the winnings at the end of the night (or whenever the agent gets kicked out of the casino).

A set of solutions that exist for the multi-armed bandit problem explore methods for learning action values, i.e. the expected reward for each action at a each time step to choose an action (or lever). Another set of solutions explore methods for learning numerical preferences or probabilities for each action at the given time step, and making a choise based on these probabilities. Gradient bandits fall under the latter category.

Contact