An Introduction to Deep Q-Learning: Let‘s Play Doom

Deep reinforcement learning (RL) has revolutionized the field of AI in recent years, enabling agents to master complex tasks from autonomous driving to robotic manipulation to strategic gameplay. One of the pivotal developments in deep RL was the introduction of the Deep Q-Network (DQN) algorithm in 2015 [1]. DQN ignited a wave of research into end-to-end RL algorithms that could learn control policies directly from high-dimensional sensory inputs like images.

In this article, we‘ll dive into the technical details of DQN and walk through a step-by-step implementation for the classic first-person shooter game Doom. We‘ll cover the key components including Q-learning, experience replay, and the use of deep convolutional neural networks (ConvNets) for approximating Q-values. Along the way, we‘ll discuss important extensions to the core algorithm and share tips and best practices for stable and efficient training. Finally, we‘ll see how to visualize the learned policies and gain insights into what the agent is actually learning.

Whether you‘re an RL researcher, a deep learning practitioner, or a curious coder, this article will equip you with a solid foundation in one of the most influential algorithms in modern AI. Let‘s get started!

From Q-Learning to Deep Q-Networks

At its core, DQN is an extension of the classic Q-learning algorithm to high-dimensional state spaces. In Q-learning, the goal is to learn an optimal action-value function Q(s,a) that predicts the expected cumulative reward of taking action a in state s and following the optimal policy thereafter. This is captured by the Bellman optimality equation:

Q(s,a) = E[r + γ max_a‘ Q(s‘,a‘)]

where r is the immediate reward, γ is a discount factor between 0 and 1, and s‘ is the next state.

Given a transition (s,a,r,s‘), we can iteratively update Q(s,a) via the temporal difference (TD) learning rule:

Q(s,a) ← Q(s,a) + α[r + γ max_a‘ Q(s‘,a‘) – Q(s,a)]

where α is the learning rate.

In tabular Q-learning, we represent Q(s,a) as a lookup table and update each entry according to the TD rule. However, this approach quickly becomes infeasible for large state spaces like images. This is where neural networks come into play.

The key idea behind DQN is to represent Q(s,a) using a deep ConvNet with parameters ω: Q(s,a; ω). Instead of updating individual Q-values, we train the network to minimize the TD error across a batch of transitions:

L(ω) = E[(r + γ max_a‘ Q(s‘,a‘; ω) – Q(s,a; ω))^2]

By sampling transitions from an experience replay buffer, we can perform gradient descent on this loss to learn an aproximate Q-function that generalizes across states.

DQN in Practice: Playing Doom

To make these ideas concrete, let‘s see how to use DQN to train an agent to play Doom. We‘ll use the VizDoom environment which provides a simple API for interacting with the game engine [2]. Here are the key steps:

  1. Preprocess frames: We first convert each RGB frame to grayscale, crop it to an 84×84 region containing the game area, and normalize pixel values to [0, 1]. We stack the 4 most recent frames to provide temporal context.
def preprocess(frame):
    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame = frame[30:-10,30:-30] # crop
    frame = cv2.resize(frame, (84, 84))
    frame = frame / 255 # normalize    
    return frame
  1. Define Q-network: We use a simple ConvNet with 3 conv layers, 1 fully connected layer, and a final linear output layer with one unit per action. The network takes a stack of 4 84×84 frames as input.
def create_model(num_actions):
    model = Sequential()
    model.add(Conv2D(32, 8, strides=4, activation=‘relu‘, 
                     input_shape=(84, 84, 4)))
    model.add(Conv2D(64, 4, strides=2, activation=‘relu‘)) 
    model.add(Conv2D(64, 3, strides=1, activation=‘relu‘))
    model.add(Flatten())
    model.add(Dense(512, activation=‘relu‘))
    model.add(Dense(num_actions, activation=‘linear‘))
    return model
  1. Initialize replay memory: We use a simple list to store the 1M most recent transitions.
replay_memory = deque(maxlen=1000000)
  1. Train loop: At each step, we perform epsilon-greedy exploration to select an action. We execute this action in the environment and store the resulting transition in replay memory. Every 4 steps, we sample a minibatch of transitions and perform a gradient descent step on the TD loss. We also periodically sync the parameters of a target Q-network.
for episode in range(num_episodes):
    state = preprocess(env.reset())
    done = False
    while not done:
        # Select epsilon-greedy action
        if np.random.rand() < epsilon:
            action = np.random.randint(num_actions)
        else:
            Q_values = model.predict(np.stack([state]))
            action = np.argmax(Q_values)

        # Execute action and store transition
        next_state, reward, done, _ = env.step(action)
        next_state = preprocess(next_state)
        replay_memory.append((state, action, reward, next_state, done))
        state = next_state

        # Train on minibatch from replay 
        if len(replay_memory) > batch_size:
            minibatch = random.sample(replay_memory, batch_size)
            states, actions, rewards, next_states, dones = zip(*minibatch)

            states = np.stack(states)
            next_states = np.stack(next_states)

            Q_values = model.predict(states)
            target_Q_values = target_model.predict(next_states)
            max_next_Q = np.max(target_Q_values, axis=1)

            targets = rewards + (1 - dones) * gamma * max_next_Q

            Q_values[[np.arange(batch_size), actions]] = targets

            model.fit(states, Q_values, verbose=0)

        # Sync target network        
        if step % target_update_freq == 0:
            target_model.set_weights(model.get_weights())

        step += 1

Through this procedure, the agent gradually learns a policy that enables it to clear levels and defeat enemies in an optimal manner.

Results and Improvements

The original DQN paper demonstrated superhuman performance on a suite of 49 Atari games, achieving a median normalized score of 121% compared to a professional human games tester [1]. Subsequent works introduced a number of enhancements to DQN that improved sample efficiency and stability:

  • Double DQN uses the online Q-network to select the best action, but the target network to estimate its value [3]. This reduces overestimation bias caused by the max operator in standard Q-learning. DDQN achieved a median normalized score of 172% on Atari.

  • Prioritized experience replay samples transitions with probability proportional to their TD error [4]. This focuses training on the most surprising transitions and speeds up learning. Schaul et al. reported 2-6x gains in sample efficiency on Atari.

  • Dueling networks have separate streams for estimating state value and action advantages [5]. This form of Q-value decomposition improves generalization and leads to faster convergence. Dueling DQN attained a median normalized score of 228% on Atari.

  • Multi-step learning uses n-step returns in the TD loss, trading off bias and variance [6]. This helps propagate rewards more quickly and stabilizes learning.

The current state-of-the-art is the Rainbow algorithm which combines several of these advances [7]. Rainbow achieves a median normalized score of 865% on Atari and exhibits significantly improved sample efficiency compared to vanilla DQN.

Here are some tips for reproducing these results in your own DQN implementations:

  • Use Adam optimizer with a learning rate around 0.0001.
  • Anneal epsilon from 1.0 to 0.01 over the first million steps.
  • Train for at least 50M steps (200M for the hardest games).
  • Clip gradients to have L2 norm at most 10.
  • Use double Q-learning and 3-step returns if possible.
  • Prioritized replay is a big win but can be tricky to implement efficiently.
  • Be patient! DQN can take several days to train even on a fast GPU.

Visualizing Learned Policies

To better understand what DQN is learning, it‘s useful to visualize the features and policies learned by the Q-network. One approach is to compute t-SNE embeddings of the final ConvNet layer activations and color-code them by predicted action [8]. This reveals clusters corresponding to distinct states that elicit similar actions.

We can also generate saliency maps showing which pixels in the input image the Q-network attends to by computing the Jacobian of the predicted Q-values with respect to the input [9]. Brighter regions correspond to pixels that have a greater influence on the predicted action values.

Finally, we can perform occlusion experiments where we measure the drop in predicted Q-value as we systematically mask out different regions of the input [9]. This highlights the key objects and features the agent relies on to make decisions.

Together, these visualizations provide valuable insights into the strategies and representations learned by DQN agents. They reveal that the agent learns to attend to the most informative regions of the image and integrates information over time to make robust decisions.

Conclusion

In this article, we took a comprehensive look at Deep Q-Networks and their application to playing Doom. We started with the fundamentals of Q-learning, showed how to scale it to complex visual environments using ConvNets and experience replay, and walked through a complete implementation. We also discussed several important extensions that improve the speed and stability of DQN, and showed how to visualize the learned policies for greater interpretability.

Since its introduction in 2015, DQN has had an immense impact on the field of deep RL. It has enabled researchers to tackle challenging domains ranging from robotic manipulation to 3D navigation to real-time strategy games, and has inspired new architectures and algorithms for RL. By mastering the core ideas behind DQN, you‘ll be well-equipped to understand and contribute to the latest advances in this exciting area.

Here are some resources to learn more:

I encourage you to experiment with the [code examples](link to Github) in this article and try extending DQN to new problems. Feel free to reach out with any questions or feedback. Happy training!

References

[1] Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

[2] Kempka, M., Wydmuch, M., Runc, G., Toczek, J. & Jaśkowski, W. ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning. IEEE Conference on Computational Intelligence and Games, pp. 341-348, Santorini, Greece, 2016.

[3] van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. AAAI Conference on Artificial Intelligence, 2094-2100.

[4] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. International Conference on Learning Representations (ICLR).

[5] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. International Conference on Machine Learning (ICML).

[6] Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1), 9–44.

[7] Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. and Silver, D., 2018. Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI Conference on Artificial Intelligence.

[8] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. International Conference on Machine Learning, 1928-1937.

[9] Zahavy, T., Ben-Zrihem, N., & Mannor, S. (2016). Graying the black box: Understanding DQNs. International Conference on Machine Learning (ICML).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *