• TheRealSally

First they steal our jobs, now they steal our joy.

A recount of DeepMind’s revolutionary Atari paper.

— By Siddansh Bohra, Navigation Researcher @ Sally Robotics.

Photo by Franck V. on Unsplash


In 2012, the Arcade Learning environment — a suite of 57 Atari 2600 games (dubbed Atari57) — was proposed as a benchmark set of tasks: these canonical Atari games pose a broad range of challenges for an agent to master. Achieving superhuman performance on these Atari games in 2015 was Deepmind’s rise to fame and was also an important milestone in the development of RL³. Recently Deepmind’s AlphaGo beat the greatest ‘human’ player of all time at Go which is a googol time more complex than chess. Deepmind then developed Starcraft2 which also beat the world champions.

“Intelligent behavior arises from the actions⁶ of an individual seeking to maximize its reward⁴ signals in a complex and changing world” -Micheal Litton

Let us try to understand one of the simpler papers by Deepmind in which they describe how they achieved superhuman performance in most Atari games.


How they did it?

Pre-processing the data

Data pre-processing is applied to the images to scale down/up the images, to augment the image to create a larger dataset, to have a uniform aspect ratio for all images.

The raw Atari frames, which are 210 × 160-pixel images with a 128-color palette are reduced to the input dimensionality. The raw frames are first converted to a grey-scale from their RGB representation and down-sampled to a 110×84 image. The final input representation is obtained by cropping an 84 × 84 regions of the image that roughly captures the playing area. Pre-processing is applied to the last 4 frames of history and stacked to produce the input to the Q-function.

Making the Model

The input to the neural network consists is an 84 × 84 × 4 image. The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully connected and consists of 256 rectifier units. The output layer is a fully connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games.

Fig1: Network architecture. The input to the neural network consists of 84 *84 *4 images produced by the pre-processing map.

How does it work though?

The goal of the agent⁷ is to select actions that maximize cumulative future reward⁴. The deep convolutional neural network approximates the optimal action-value function at each timestep(t), achievable by a behavior policy⁸, after observing the current state⁵(s) of the agent and taking an action(a). All positive rewards are fixed at 1 and all negative rewards at −1, leaving 0 rewards unchanged.

An approximate value function Q(s,a;θi) is parameterized using the deep convolutional neural network shown in Fig. 1, in which θi are weights of the Q-network at iteration i. To perform experience replay⁹, the agent’s experiences et = (st,at,rt,st+1) are stored at each time-step t in a data set Dt = {e1,…,et}. During learning Q-learning updates are applied on mini-batches of experience (s,a,r,s’) ~ U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration i uses the following loss function1:

θi^- are the target network parameters updated with the Q-network¹² parameters θi.

The optimization algorithm used is the RMSProp algorithm with mini-batches of size 32. The behavior policy during training was Ɛ-greedy¹¹ with Ɛ annealed linearly from 1 to 0.1 over the first million frames and fixed at 0.1 thereafter. The network is trained for a total of 10 million frames and used a replay memory of one million most recent frames.

Checking if it works

In supervised learning, it is easy to track the performance of a model during training by evaluating it on the test and validation sets. In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since the evaluation metric is the total reward the agent collects in an episode or game averaged over a number of games, it is periodically computed during training. The average total reward metric tends to be very noisy because small changes to the weights of policy can lead to large changes in the distribution of states the policy visits. The leftmost two plots in fig.2 show how the average total reward evolves during training on the games Seaquest and Breakout. Both averaged reward plots are noisy, giving the impression that the learning algorithm is not making steady progress.

A more stable, metric is the policy’s estimated action-value¹⁰ function Q, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state. A fixed set of states is collected by running a random policy before training starts and tracking the average of the maximum(the max for each state is taken over possible actions) predicted Q for these states. The two rightmost plots in fig.2 show that average predicted Q increases much more smoothly than the average total reward obtained by the agent and plotting the same metrics on the other five games produces similarly smooth curves.

This suggests that, despite lacking any theoretical convergence guarantees, this method can train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner. This approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge

Figure2: Training curves tracking the agent’s average score and average predicted action-value. a, each point is the average score achieved per episode after the agent is run with e-greedy policy (e = 0.05) for 520 k frames on Space Invaders. b, Average score achieved per episode for Seaquest. c, Average predicted action-value on a held-out set of states on Space Invaders. Each point on the curve is the average of the action-value Q computed over the held-out set of states. d, Average predicted action-value on Seaquest. See Supplementary Discussion for details.


This DQN method outperformed the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches. The DQN agent performed at a level that was comparable to that of a professional human game tester across the set of 49 games, achieving more than 75% of the human score on more than half of the games (29 games).

Figure3: The performance of DQN is normalized with respect to a professional human game tester (that is, 100% level) and random play (that is, 0% level). The normalized performance of DQN, expressed as a percentage, is calculated as 100 * (DQN score — random play score)/(human score — random play score).

Other interesting facts about this paper

  • During the training of this DQN agent on the breakout, it was observed that it did not make any intelligent moves in the first 10 mins of playing and lost all its lives very fast. At the 120-minute mark, it achieved human-level performance in most games and continued improvements to achieve superhuman level performance after about 2 hours of training.

  • After presenting their initial results with the algorithm, Google almost immediately acquired the company for several hundred million dollars, hence the name Google DeepMind. Google bought London-based artificial intelligence company DeepMind for more than $500 million.

  • The AI developed an interesting strategy to beat the game breakout. The AI started tunneling so that it could hit the bricks from behind.

  • As you observe from fig.3, the DQN failed to perform well on a few games. Agent57 changes this and is the most general agent in Atari57(the suite of 57 Atari games). Agent57 obtains above human-level performance on the very hardest games in the benchmark set, as well as the easiest ones.

Agent57 vs other agents on the toughest Atari games.


This paper introduced a new deep learning model for reinforcement learning and demonstrated its ability to master difficult control policies for Atari 2600 computer games, using only raw pixels as input. Players train for 1000s of hours to become professional players but modern reinforcement learning algorithms can beat the best players with just a few hours of training. This paper laid the foundation for Deepmind to go on and beat several world champions at their games.

Important terms

  1. Loss function: It is a method of evaluating how well specific algorithm models the given data. If predictions deviate too much from actual results, loss function would result in a very large number.

  2. Deep neural net: A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship.

  3. Reinforcement Learning: Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the cumulative reward.

  4. Reward: A numerical value received by the Agent from the Environment as a direct response to the Agent’s actions. The reward is discounted by a discount function ( γ ) overtime to motivate the agent to take action.

The essence of Reinforcement Learning

5. State: Every scenario the Agent encounters in the Environment is formally called a state.

6. Action: Actions are the Agent’s methods which allow it to interact and change its environment, and thus transfer between states. The decision of which action to choose is made by the policy.

7. Agent: Agent is the model that we try to design.

8. Policy: The policy, denoted as π (or sometimes π(a|s)), is a mapping from some state s to the probabilities of selecting each possible action given that state.

9. Experience replay: Experience replay enables reinforcement learning agents to memorize and reuse past experiences, just as humans replay memories for the situation at hand.

10. Action value: this provides an estimate of how much-discounted reward the agent can obtain by following its policy from any given state.

11. Ɛ-Greedy approach: In this approach, the agent explores the actions with a probability of Ɛ and exploits the greedy choice with a probability of 1- Ɛ.

12. Q-Learning: In its most simplified form, it uses a table to store all Q-Values of all possible state-action pairs possible. It updates this table using the Bellman equation, while action selection is usually made with an ε-greedy policy.




Video :

2-minute papers channel:


AI playing Breakout


© 2020 by Sally Robotics. Made with      @ BITS Pilani

  • Mail
  • Medium
  • LinkedIn
  • Twitter
  • Facebook