Playing Atari with Deep RL

deep learning model to successfully learn control policies

directly from high-dimensional sensory input using RL

CNN trained with a variant of Q-Learning

input: raw pixels whose output is a value function estimating future rewards

experiment: seven Atari 2600 games (from arcade learning environment)

with no adjustment to architecture or learning algorithm

Learning to control agents from high-dimensional sensory inputs

most success is with hand-crafted features
- combined with linear value functions or policy representations
deep learning has breakthroughs utilising neural network architectures
- CNNs, MLPs, restricted Boltzmann machines, recurrent neural networks
  - exploiting both supervised and unsupervised learning

challenges

RL algorithms must be able to learn from a scalar reward signal
- that is frequently sparse, noisy and delayed
- the delay between actions and resulting rewards
  - which can be 1000s of timesteps long
  - compared to supervised learning where direct association
    - of inputs and targets
Deep learning algorithms assume the data samples to be independent
- in RL one learning one typically encounters
  - sequences of highly correlated states
in RL the data distribution changes as the algorithm learns new behaviours
- which is problematic when assuming fixed underlying distribution

experiment

implement a convolutional neural network
- to learn successful control policies from raw video data
  - in complex RL environments
the network is trained with a variant of the Q-Learning algorithm
- with stochastic gradient descent to update the weights
to alleviate issues with correlated data and non-stationary distributions
- use an experience replay mechanism which randomly samples previous transitions
  - smoothing the training distribution over many past behaviours
applied to a range of Atari games
- Atari 2600 as a testbed
  - with visual input of 210x160 RGB video at 60GHz
  - and diverse amount of tasks to tackle to mimic human level paly

goal is to the create a single NN agent that is able to successfully play the games

the network will not be provided any game knowledge or hand-crafted features
- or privy to the internal state of the game
learning from nothing but…
- video input
- the reward and terminal signals
- the set of possible actions

network architecture and hyperparameters used for training

are kept constant across games

Background on formulaic representations for Q-Learning

Consider tasks in which an agent interacts with an environment \(\varepsilon\)

in this the Atari emulator, in a sequence of actions, obseravations, and rewards

At each time step the agent selects an action \(a_{t}\)

from set of legal game actions \(\mathcal{A} = \{1,...,\mathcal{K}\}\)

The action is passed to the emulator

and modifies its internal state and the game score

in general \(\varepsilon\) may be stochastic

the emulator's internal state is not observed by the agent
- instead it observes an image \(x_{t} \in \mathbb{R}^{d}\) from the emulator
  - which is a vector of raw pixel values representing the current screen

In addition it receives a reward \(r_{t}\)

it represents the change in game score note: general game score may depend on a prior sequence of actions and observations
- feedback about an action may only be received after thousands of time steps are elapsed

the task is partially observed and many emulator states are perceptually aliased

i.e. its impossible to fully understand the current situation
- from only the current screen \(x_{t}\)
we therefore consider sequences of actions and observations, \(s_{t}=x_{1},a_{1},x_{2},..a_{t-1},x_{t}\)
- and learn game strategies that depend upon these sequences
all sequences in the emulator are assumed to terminate
- in a finite number of timesteps
  - giving rise to a large but finite Markov Decision Process
- in which each sequence is a distinct state
  
  apply standard RL methods for MDPs by using…
  - complete sequence \(s_{t}\) as the state representation at time \(t\)
the goal of the agent is to interact with the emulator be selecting actions
- in a way that maximizes future rewards
- we make the standard assumption
  - that future rewards are discounted by a factor of \(\gamma\) per timestep
  - and define the future discounted \(return\) at time \(t\) as: \(R_{t}=\sum_{t^{'}=t}^{T}\gamma^{t^{'}-t}r_{t^{'}}\)
    
    Where \(T\) is the time step at which the game terminates
we define optimal action-value function \(Q^{*}(s,a)\) as the maximum expected return achievable by following any strategy
- after seeing some sequence \(s\)
  - then taking some action \(a\) \(Q^{*}(s,a) = max_{\pi}\mathbb{E}[R_{t}|s_{t} = s, a_{t}=a, \pi]\)
    
    Where \(\pi\) is a policy mapping sequences to actions (or distributions over actions)
the optimal action-value function obeys an identity known as the Bellman Equation
- if the optimal value \(Q^{*}(s^{'},a^{'})\) of the sequence \(s^{'}\)
  - at the next time step was known for all possible actions \(a^{'}\)
    - then the optimal strategy is to select the action \(a^{'}\)
      - maximizing the expected value of \(r + \gamma Q^{*}(s^{'},a^{'})\)
        
        \(Q^{*}(s,a) = \mathbb{E}_{s^{'}~\varepsilon}[r+\gamma max(a^{'})Q^{*}(s^{'},a^{'})|s,a]\)

the basic idea behind RL is to estimate action-value function

by using the Bellman equation as an iterative update \(Q_{i+1}(s,a) = \mathbb{E}_{s^{'} \sim \varepsilon}[r+\gamma max(a^{'})Q^{*}(s^{'},a^{'})|s,a]\)
- such value iterations converge to the optimal action-value function \(Q_{i} \rightarrow Q^{*}\) as \(i \rightarrow \propto\) in practice this is totally impractical
  - the action-value function is estimated separately
    - for each sequence, without any generalisation
- instead its common to use a function approximator to estimate action-value function \(Q(s,a;\theta)\approx Q^{*}(s,a)\)
  - in RL typically this is a linear function approximator
    - sometimes non-linear is used like a neural network
- we refer to *a neural network function approximator
  - with weights \(\theta\) as a Q-network
    - a Q-network can be trained by minimizing a sequence of loss functions \(L_{i}(\theta_{i})\)
      - that changes at each iteration \(i\) \(L_{i}(\theta_{i})=\mathbb{E}_{s,a \sim p(.)}[(y_{i}-Q(s,a;\theta_{i}))^{2}]\) Where \(y_{i} = \mathbb{E}_{s,a \sim \varepsilon}[r+\gamma max_{a^{'}}Q(s,a;\theta_{i})| s,a]\)
        
        is the target for iteration \(i\) and \(p(s,a)\)
        
        is a probability distribution over sequences \(s\) and actions \(a\)
        
        that we refer to as the behaviour distribution
        
        the parameters from the previous iteration \(\theta_{i-1}\) are held fixed
        
        when optimising the loss function \(L_{i}(\theta_{i})\) note: the targets depend on the network weights
        
        in contrast with the targets used for supervised learning
        
        which are fixed before learning begins
      - Differentiating the loss function with respect to the weights
        
        we arrive at the following gradient \(\Delta_{\theta_{i}}L_{i}(\theta_{i}) = \mathbb{E}_{s,a \sim p(.);s^{'} \sim \varepsilon}[(r+\gamma max_{a^{i}}Q(s^{'},a^{'};\theta_{i-1})-Q(s,a;\theta_{i}))\Delta_{\theta_{i}}Q(s,a;\theta_{i})]\)
      rather than computing the full expectations in the above gradient
      - it is often computationally expedient to optimize the loss function by stochastic gradient descent
        
        if the weights are updated after every time-step
        
        and the expectations are replaced by a single samples
        
        from the behavior distribution \(\rho\) and the emulator \(\varepsilon\) respectively
        
        then we arrive at the familiar Q-learning algorithm
      Q-learning algorithm is model-free, it solves RL task using samples from emulator \(\varepsilon\)
      - without explicitly constructing an estimate of \(\varepsilon\)
      Q-learning is off-policy, it learns about the greedy state \(a=max_{a}Q(s,a;\theta)\)
      - while following a behaviour distribution that ensures adequate exploration of state space in practice, the behaviour distribution is selected by a \(\epsilon\) - greedy strategy
        
        that follows the greedy strategy with probabilty \(1 - \epsilon\)
        
        and selects a random action with probability \(\epsilon\)

Deep RL

the most successful approaches are trained from raw inputs
- using lightweight updates basd on stochastic gradient descent
by feeding data into neural networks
- it is possible to learn better feature representations

goal: connect RL algorithm to a deep neural network

which operates directly on RGB images
- and efficiently process training data by using stochastic gradient descent updates

Experience Replay and Deep Q Networks

experience replay where we store the agent's experiences at each time-step \(e_{t}=(s_{t},a_{t},r_{t},s_{t+1})\) in a dataset \(\mathcal{D} = e_{1},...,e_{N}\)
- pooled over many episodes into a replay memory
During the inner loop of the algorithm
- we apply Q-learning updates
  - or minibatch updates
- to samples of experience \(e \sim \mathcal{D}\)
  - drawn at random from the pool of stored samples
after performing experience replay
- the agent selects and executes
  - an action according to an \(\epsilon\) - greedy policy

since histories of arbitrary lengths as inputs to a neural network can be difficult

our Q-function instead works on fixed length representation of histories
- produced by a function \(\phi\)

algorithm

this approach is advantageous compared to online Q-learning
- each step of experience is potentially used in many weight updates
  - allowing for greater data efficiency
- learning directly from consecutive samples is inefficient
  - due to strong correlation between samples…
  - randomizing the samples breaks these corelations
    - therefore reduces the variance of the updates
- when learning on-policy the current parameters determine the next data sample
  - that the parameters are trained on
    - its easy to see that unwanted feedback loops may arise
      - and the parameters could get stuck in a poor local minimum or diverge
  - by using experience replay the behaviour distribution
    - is averaged over many of its previous states
      - smoothing out learning and avoiding oscillations
        
        or divergence in the parameters
      note: when learning by experience replay, it is neccessary to learn…
      - off-policy, becuase our current parameters are different
        
        to those used to generate the sample
in practice…
- algorithm only stores the last \(N\) experience tuples in the replay memory
  - and samples uniformly at random from \(D\) when performing updates
    - limited since the memory buffer does not differentiate important transitions
      - and always overwrites with recent transitions due to finite memory size \(N\)
    - similary the uniform sampling gives
      - equal importance to all transitions in the replay memory

a sophisticated sampling strategy can emphasize transitions

from which we can learn the most… like prioritized sweeping

Deep Q-Learning with Experience replay

initialize replay memory \(\mathcal{D}\) to capacity \(N\)
initialize action-value function \(Q\) with random weights

for episode = 1, \(M\) do

initialize sequence \(s_{1} = \{x_{1}\}\) and preprocessed sequenced \(\phi_{1} = \phi(s_{1})\)

for \(t = 1, T\) do

with probability \(\epsilon\) select a random action \(a_{t}\)
- otherwise select \(a_{t} = max_{a}Q^{*}(\phi(s_{t}),a;\theta)\)
execute action \(a_{t}\) in emulator and observe reward \(r_{t}\) and image \(x_{t+1}\)
set \(s_{t+1} = s_{t},a_{t},x_{t+1}\) and preprocess \(\phi_{t+1}=\phi(s_{t+1})\)
store transition \((\phi_{t},a_{t},r_{t},\phi_{t+1})\) in \(\mathcal{D}\)
sample random minibatch of transitions \((\phi_{t},a_{t},r_{t},\phi_{t+1})\) from \(\mathcal{D}\)
- set \(y_{j} = \begin{cases} r_{j} & for terminal \phi_{j+1}\\ r_{j}+\gamma max_{a^{'}}Q(\phi_{j+1},a^{'};\theta) & for non-terminal \phi_{j+1}\end{cases}\)
  
  Perform a gradient descent step on \((y_{j} - Q(\phi_{j},a_{j};\theta))^{2}\)

end for end for

Preprocessing and Model Architecture

Atari frames are 210 \(\times\) 160 pixel images
- with a 128 color palette
raw frames are preprocessed
- by converting RGB to gray-scale
  - and downsampling to 110 \(\times\) 84 image
    - final image is cropped to 84 \(\times\) 84 region of the image note: implementation expects square inputs

from the experiments

the function \(\phi\) from pseudo-algorithm above applies this preprocessing
- to the last 4 frames of a history
  - and stacks them to produce the input to the \(Q\) - function
\(Q\) maps history-action pairs to scalar estimates of their \(Q\) - values
- the history and the action have been used as inputs to the neural network

the main drawback of this architecture

a separate forward-pass is required to compute the \(Q\) value
- of each action, resulting in a cost that scales linearly
  - with the number of actions

instead we use architecture where there is a separate output unit

for each possible action, and only the state representation
- is an input to the neural network
output corresponds to the prediced \(Q\) value
- of the individual action for the input state main advantage: computing Q-values for all possible actions
  - in a given state with only a single forwad pass through the network

input to the neural networks consists of an \(84 \times 84 \times 4\) image produced by \(\phi\) .

first hidden layer convolves 16 \(8 \times 8\) filters with stride 4 with the input image and applies a rectifier nonlinearity second hidden layer convolves 32 \(4 \times 4\) filters with stride 2 followed by rectifier nonlinearity final hidden layer is fully-connected and consists of 256 rectifier units

number of valid actions varied between 4 and 18 on the games we considered

DeepQLearning.jl

an exploration policy and evaluation policy can be specified in the solver parameters
- exploration policy can be in the form of a function
- provided will be called as follows:
  - f(policy, env, obs, global_step, rng) where
    - policy is the NN policy being trained,
    - env the environment,
    - obs the observation at which to take the action,
    - global_step the interaction step of the solver,
    - rng a random number generator. also provides by default an epsilon greedy policy with linear decrease of epsilon with global_step
- eval policy similarly f(policy, env, n_eval, max_{episodelength}, verbose)
qnetwork optins of the solver should accept any Chain object
- MLPs or CNNs followed by dense layer is expected
- dueling options will split all the dense layers at the end
- flattenbatch to flatten all dims but the batch size
- input size of network must be specified when you create a q network

This Pkg exports type AbstractNNPolicy which represents neural network based policy

in addition to core functions from POMDPs.jl
- AbstractNNPolicy also supports getnetwork(policy), resetstate!(policy)

for saving and loading models, the solvers saves the network as a bson in…

solver.logdir/"qnetwork.bson"

Logging is done through TensorBoardLogger.jl, log directory is specified solver options

GPU support has not been tested thoroughly

using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPTools

# MDP model from POMDPModels
mdp = SimpleGridWorld()

# define Q network
model = Chain(Dense(2,32), Dense(32, length(actions(mdp))))

exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))

solver = DeepQLearningSolver(qnetwork = model, max_steps=10000m
                             exploration_policy= exploration,
                             learning_rate=0.005, log_freq=500,
                             recurrence=false, double_q=true,
                             dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)

sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim,mdp,policy)
println("Total discounted reward for 1 simulation: $r_tot")

Julia Deep Q Learning