An intro to Advantage Actor Critic methods: let‘s play Sonic the Hedgehog!

In recent years, reinforcement learning (RL) has emerged as a powerful framework for training artificial agents to master complex tasks, from playing chess and Go to controlling robots and self-driving cars. At the core of RL is the idea that an agent can learn optimal behavior through trial-and-error interactions with its environment, guided by a reward signal.

Two main approaches have been widely studied:

  1. Value-based methods, like Q-learning and Deep Q-Networks (DQN), which learn a value function that estimates the expected return from each action in each state
  2. Policy-based methods, like REINFORCE with policy gradients, which directly optimize a policy that maps states to action probabilities, guided by the returns obtained

While both approaches have achieved remarkable successes, they also have significant limitations. Value-based methods can handle discrete action spaces well but struggle with continuous or high-dimensional actions. Policy gradients are highly flexible but suffer from high variance and often require prohibitively large sample sizes.

Intriguingly, a hybrid family of algorithms called actor-critic methods promises to combine the strengths of value-based and policy-based RL while mitigating their weaknesses. Actor-critic architectures lean on two key ideas:

  1. An actor network that directly outputs actions given the current state
  2. A critic network that estimates the expected return (value) of states or state-action pairs

The actor proposes actions and the critic evaluates them, providing a training signal to guide the actor towards choosing better actions over time. Metaphorically, the actor is like a person trying to master a skill, while the critic is like a coach offering feedback and suggesting improvements.

Actor-critic diagram

A key innovation that actor-critic methods introduced is the advantage function. The advantage of an action in a given state is the difference between the return obtained by taking that action versus the expected return of that state, averaged over all actions. Mathematically:

A(s,a) = Q(s,a) – V(s)

Where Q(s,a) is the expected return of taking action a in state s, and V(s) is the expected return from state s (averaged over actions).

Intuitively, the advantage function quantifies how much better or worse a particular action is compared to the "average" action in that state. A positive advantage suggests the action is better than average and should be taken more often, while a negative advantage suggests the opposite.

By using the advantage in place of raw returns to guide policy updates, actor-critic methods can substantially reduce the variance of gradient estimates, leading to faster and more stable learning. Recently, this advantage actor-critic approach has been extended in two primary ways:

  1. Asynchronous Advantage Actor-Critic (A3C) runs multiple agents in parallel, each interacting with its own copy of the environment. The agents send their gradients asynchronously to a central network that does periodic updates. This stabilizes learning and exploits modern multi-core architectures.

  2. Synchronous Advantage Actor-Critic (A2C) simplifies A3C by batching the experience from all agents and synchronously updating the network using the collected experiences. While less scalable than A3C, in practice A2C actually tends to outperform A3C as the updates are less noisy.

To make these ideas concrete, let‘s explore how to implement an A2C agent to play Sonic the Hedgehog, the classic Sega Genesis platformer game. While Sonic has been available to play through OpenAI‘s Retro contest, we‘ll use the Gym Retro library to interface with the game engine.

Our A2C agent will use two neural networks – a policy network (actor) that maps game frames to actions, and a value network (critic) that estimates the expected future return from each state. We can implement both using convolutional layers to process the pixel inputs, followed by dense layers to produce action probabilities and value estimates.

import gym
import numpy as np 
import tensorflow as tf

class Actor(tf.keras.Model):
  def __init__(self, action_dim):
    super().__init__()
    self.conv1 = tf.keras.layers.Conv2D(32, 8, 4, activation=‘relu‘)
    self.conv2 = tf.keras.layers.Conv2D(64, 4, 2, activation=‘relu‘)
    self.conv3 = tf.keras.layers.Conv2D(64, 3, 1, activation=‘relu‘)
    self.flatten = tf.keras.layers.Flatten()
    self.dense1 = tf.keras.layers.Dense(512, activation=‘relu‘)
    self.policy = tf.keras.layers.Dense(action_dim)

  def call(self, x):
    x = self.conv1(x)
    x = self.conv2(x) 
    x = self.conv3(x)
    x = self.flatten(x)
    x = self.dense1(x)
    return self.policy(x)

class Critic(tf.keras.Model):
  def __init__(self):
    super().__init__()
    self.conv1 = tf.keras.layers.Conv2D(32, 8, 4, activation=‘relu‘)
    self.conv2 = tf.keras.layers.Conv2D(64, 4, 2, activation=‘relu‘)
    self.conv3 = tf.keras.layers.Conv2D(64, 3, 1, activation=‘relu‘)
    self.flatten = tf.keras.layers.Flatten()
    self.dense1 = tf.keras.layers.Dense(512, activation=‘relu‘)
    self.value = tf.keras.layers.Dense(1)

  def call(self, x): 
    x = self.conv1(x)
    x = self.conv2(x)
    x = self.conv3(x)
    x = self.flatten(x)
    x = self.dense1(x)
    return self.value(x)

To train the agent, we‘ll use multiple worker threads, each running their own copy of the environment and collecting experience in parallel. Here‘s the basic logic for each worker:

def run_worker(worker_id, master_net, output_queue):
  actor = Actor(action_dim)
  critic = Critic()

  env = gym.make(‘SonicTheHedgehog-Genesis‘)
  state = env.reset()

  while True:
    trajectory = []
    score = 0

    for _ in range(traj_length):
      state = tf.convert_to_tensor(state)
      state = tf.expand_dims(state, 0)

      # Sample action from policy
      action_logits = actor(state)
      action = tf.random.categorical(action_logits, 1)

      next_state, reward, done, _ = env.step(action)
      score += reward
      trajectory.append([state, action, reward])
      state = next_state

      if done:
        break

    #Compute values and advantage estimates         
    states = tf.concat([traj[0] for traj in trajectory], axis=0)
    actions = tf.concat([traj[1] for traj in trajectory], axis=0) 
    rewards = [traj[2] for traj in trajectory]

    values = critic(states)
    returns = calculate_returns(rewards)
    advantage = returns - values

    output_queue.put([states, actions, advantage, score])

The workers execute in parallel, periodically sending their experience to a master network which batches it and performs a synchronized update:

master_actor = Actor(action_dim)  
master_critic = Critic()

optimizer = tf.keras.optimizers.Adam()

while True:
  #Collect experience from workers
  trajectories = []
  scores = []

  for _ in range(num_workers):
    traj = output_queue.get()
    trajectories.append(traj[:-1]) 
    scores.append(traj[-1])

  # Batch data from trajectories 
  states = tf.concat([traj[0] for traj in trajectories], axis=0) 
  actions = tf.concat([traj[1] for traj in trajectories], axis=0)
  advantage = tf.concat([traj[2] for traj in trajectories], axis=0)

  # Update networks
  with tf.GradientTape() as tape:    
    action_logits = master_actor(states)
    log_probs = compute_log_probs(action_logits, actions)

    actor_loss = -tf.reduce_mean(log_probs * advantage)
    critic_loss = tf.reduce_mean(tf.square(advantage))

    actor_grads = tape.gradient(actor_loss, master_actor.trainable_variables)
    critic_grads = tape.gradient(critic_loss, master_critic.trainable_variables)

  optimizer.apply_gradients(zip(actor_grads, master_actor.trainable_variables))
  optimizer.apply_gradients(zip(critic_grads, master_critic.trainable_variables))

By running this training loop for 10+ hours on a modern GPU, we can obtain an agent that plays Sonic at a reasonable human level. Below are examples of the agent‘s performance after 10 minutes vs 10 hours of training:

Sonic agent after 10 min vs 10 hours

As we can see, while the 10 minute agent mostly just runs forward and dies quickly, the 10 hour agent has learned to avoid obstacles, collect rings, and make significant progress in each level before dying. Further training would likely yield even better results.

Of course, A2C is not without limitations. Most notably, it still suffers from high sample complexity, requiring millions of frames of experience to learn even simple behaviors. It also tends to be brittle with respect to hyperparameter settings and can fail to discover effective strategies in sparse reward environments.

Some of these issues can be mitigated by more advanced algorithms like Proximal Policy Optimization (PPO), which uses a slightly different objective function and training procedure to improve efficiency and robustness. In a follow-up post, we‘ll explore how to implement PPO to play later Sonic games like Sonic 2 and 3.

Nevertheless, A2C remains a highly instructive algorithm to study, as it cleanly illustrates several key ideas that underpin many state-of-the-art RL systems. By understanding the core concepts of actor-critic architectures, advantage estimation, and parallel trajectory collection, you‘ll be well-equipped to dive into the latest and greatest methods.

I encourage you to try implementing A2C yourself, either for Sonic or another favorite game. Experiment with different network architectures, hyperparameters, and reward functions, and see how the agent‘s behavior evolves over time. I think you‘ll find that watching an AI system gradually master a complex task is an immensely rewarding (no pun intended) experience.

In conclusion, actor-critic methods represent an important class of RL algorithms that combine the benefits of value-based and policy-based approaches. By learning a critic to estimate action advantages and guide policy updates, actor-critic agents achieve good sample efficiency and stability. A2C provides a simple yet effective synchronous paradigm for training actor-critic agents in parallel.

While we‘ve only scratched the surface of this fascinating field, I hope this post has piqued your interest and given you a solid foundation for further exploration. The journey to building intelligent agents that can learn and adapt in complex environments is still in its early stages, but the pace of progress is accelerating rapidly. As RL algorithms continue to evolve and scale, I believe they will play an increasingly pivotal role in unlocking the potential of AI to transform our world for the better.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *