Playing Atari with Deep RL
deep learning model to successfully learn control policies
- directly from high-dimensional sensory input using RL
CNN trained with a variant of Q-Learning
input: raw pixels whose output is a value function estimating future rewards
experiment: seven Atari 2600 games (from arcade learning environment)
- with no adjustment to architecture or learning algorithm
Learning to control agents from high-dimensional sensory inputs
-
most success is with hand-crafted features
- combined with linear value functions or policy representations
-
deep learning has breakthroughs utilising neural network architectures
-
CNNs, MLPs, restricted Boltzmann machines, recurrent neural networks
- exploiting both supervised and unsupervised learning
-
CNNs, MLPs, restricted Boltzmann machines, recurrent neural networks
challenges
-
RL algorithms must be able to learn from a scalar reward signal
- that is frequently sparse, noisy and delayed
-
the delay between actions and resulting rewards
- which can be 1000s of timesteps long
-
compared to supervised learning where direct association
- of inputs and targets
-
Deep learning algorithms assume the data samples to be independent
-
in RL one learning one typically encounters
- sequences of highly correlated states
-
in RL one learning one typically encounters
-
in RL the data distribution changes as the algorithm learns new behaviours
- which is problematic when assuming fixed underlying distribution
experiment
-
implement a convolutional neural network
-
to learn successful control policies from raw video data
- in complex RL environments
-
to learn successful control policies from raw video data
-
the network is trained with a variant of the Q-Learning algorithm
- with stochastic gradient descent to update the weights
-
to alleviate issues with correlated data and non-stationary distributions
-
use an experience replay mechanism which randomly samples previous transitions
- smoothing the training distribution over many past behaviours
-
use an experience replay mechanism which randomly samples previous transitions
-
applied to a range of Atari games
-
Atari 2600 as a testbed
- with visual input of 210x160 RGB video at 60GHz
- and diverse amount of tasks to tackle to mimic human level paly
-
Atari 2600 as a testbed
goal is to the create a single NN agent that is able to successfully play the games
-
the network will not be provided any game knowledge or hand-crafted features
- or privy to the internal state of the game
-
learning from nothing but…
- video input
- the reward and terminal signals
- the set of possible actions
network architecture and hyperparameters used for training
- are kept constant across games
Background on formulaic representations for Q-Learning
Consider tasks in which an agent interacts with an environment \(\varepsilon\)
- in this the Atari emulator, in a sequence of actions, obseravations, and rewards
At each time step the agent selects an action \(a_{t}\)
- from set of legal game actions \(\mathcal{A} = \{1,...,\mathcal{K}\}\)
The action is passed to the emulator
- and modifies its internal state and the game score
in general \(\varepsilon\) may be stochastic
-
the emulator's internal state is not observed by the agent
-
instead it observes an image \(x_{t} \in \mathbb{R}^{d}\) from the emulator
- which is a vector of raw pixel values representing the current screen
-
instead it observes an image \(x_{t} \in \mathbb{R}^{d}\) from the emulator
In addition it receives a reward \(r_{t}\)
-
it represents the change in game score note: general game score may depend on a prior sequence of actions and observations
- feedback about an action may only be received after thousands of time steps are elapsed
the task is partially observed and many emulator states are perceptually aliased
-
i.e. its impossible to fully understand the current situation
- from only the current screen \(x_{t}\)
-
we therefore consider sequences of actions and observations, \(s_{t}=x_{1},a_{1},x_{2},..a_{t-1},x_{t}\)
- and learn game strategies that depend upon these sequences
-
all sequences in the emulator are assumed to terminate
-
in a finite number of timesteps
- giving rise to a large but finite Markov Decision Process
-
in which each sequence is a distinct state
apply standard RL methods for MDPs by using…
- complete sequence \(s_{t}\) as the state representation at time \(t\)
-
-
the goal of the agent is to interact with the emulator be selecting actions
- in a way that maximizes future rewards
-
we make the standard assumption
-
that future rewards are discounted by a factor of \(\gamma\) per timestep
-
and define the future discounted \(return\) at time \(t\) as: \(R_{t}=\sum_{t^{'}=t}^{T}\gamma^{t^{'}-t}r_{t^{'}}\)
Where \(T\) is the time step at which the game terminates
-
-
we define optimal action-value function \(Q^{*}(s,a)\) as the maximum expected return achievable by following any strategy
-
after seeing some sequence \(s\)
-
then taking some action \(a\) \(Q^{*}(s,a) = max_{\pi}\mathbb{E}[R_{t}|s_{t} = s, a_{t}=a, \pi]\)
Where \(\pi\) is a policy mapping sequences to actions (or distributions over actions)
-
-
after seeing some sequence \(s\)
-
the optimal action-value function obeys an identity known as the Bellman Equation
-
if the optimal value \(Q^{*}(s^{'},a^{'})\) of the sequence \(s^{'}\)
-
at the next time step was known for all possible actions \(a^{'}\)
-
then the optimal strategy is to select the action \(a^{'}\)
-
maximizing the expected value of \(r + \gamma Q^{*}(s^{'},a^{'})\)
\(Q^{*}(s,a) = \mathbb{E}_{s^{'}~\varepsilon}[r+\gamma max(a^{'})Q^{*}(s^{'},a^{'})|s,a]\)
-
-
then the optimal strategy is to select the action \(a^{'}\)
-
at the next time step was known for all possible actions \(a^{'}\)
-
if the optimal value \(Q^{*}(s^{'},a^{'})\) of the sequence \(s^{'}\)
the basic idea behind RL is to estimate action-value function
-
by using the Bellman equation as an iterative update \(Q_{i+1}(s,a) = \mathbb{E}_{s^{'} \sim \varepsilon}[r+\gamma max(a^{'})Q^{*}(s^{'},a^{'})|s,a]\)
-
such value iterations converge to the optimal action-value function \(Q_{i} \rightarrow Q^{*}\) as \(i \rightarrow \propto\) in practice this is totally impractical
-
the action-value function is estimated separately
- for each sequence, without any generalisation
-
the action-value function is estimated separately
-
instead its common to use a function approximator to estimate action-value function \(Q(s,a;\theta)\approx Q^{*}(s,a)\)
-
in RL typically this is a linear function approximator
- sometimes non-linear is used like a neural network
-
in RL typically this is a linear function approximator
-
we refer to *a neural network function approximator
-
with weights \(\theta\) as a Q-network
-
a Q-network can be trained by minimizing a sequence of loss functions \(L_{i}(\theta_{i})\)
-
that changes at each iteration \(i\) \(L_{i}(\theta_{i})=\mathbb{E}_{s,a \sim p(.)}[(y_{i}-Q(s,a;\theta_{i}))^{2}]\) Where \(y_{i} = \mathbb{E}_{s,a \sim \varepsilon}[r+\gamma max_{a^{'}}Q(s,a;\theta_{i})| s,a]\)
-
is the target for iteration \(i\) and \(p(s,a)\)
-
is a probability distribution over sequences \(s\) and actions \(a\)
- that we refer to as the behaviour distribution
-
is a probability distribution over sequences \(s\) and actions \(a\)
-
the parameters from the previous iteration \(\theta_{i-1}\) are held fixed
-
when optimising the loss function \(L_{i}(\theta_{i})\) note: the targets depend on the network weights
-
in contrast with the targets used for supervised learning
- which are fixed before learning begins
-
in contrast with the targets used for supervised learning
-
when optimising the loss function \(L_{i}(\theta_{i})\) note: the targets depend on the network weights
-
is the target for iteration \(i\) and \(p(s,a)\)
-
Differentiating the loss function with respect to the weights
- we arrive at the following gradient \(\Delta_{\theta_{i}}L_{i}(\theta_{i}) = \mathbb{E}_{s,a \sim p(.);s^{'} \sim \varepsilon}[(r+\gamma max_{a^{i}}Q(s^{'},a^{'};\theta_{i-1})-Q(s,a;\theta_{i}))\Delta_{\theta_{i}}Q(s,a;\theta_{i})]\)
-
it is often computationally expedient to optimize the loss function by stochastic gradient descent
-
if the weights are updated after every time-step
-
and the expectations are replaced by a single samples
-
from the behavior distribution \(\rho\) and the emulator \(\varepsilon\) respectively
- then we arrive at the familiar Q-learning algorithm
-
from the behavior distribution \(\rho\) and the emulator \(\varepsilon\) respectively
-
and the expectations are replaced by a single samples
-
if the weights are updated after every time-step
- without explicitly constructing an estimate of \(\varepsilon\)
-
while following a behaviour distribution that ensures adequate exploration of state space in practice, the behaviour distribution is selected by a \(\epsilon\) - greedy strategy
-
that follows the greedy strategy with probabilty \(1 - \epsilon\)
- and selects a random action with probability \(\epsilon\)
-
that follows the greedy strategy with probabilty \(1 - \epsilon\)
-
that changes at each iteration \(i\) \(L_{i}(\theta_{i})=\mathbb{E}_{s,a \sim p(.)}[(y_{i}-Q(s,a;\theta_{i}))^{2}]\) Where \(y_{i} = \mathbb{E}_{s,a \sim \varepsilon}[r+\gamma max_{a^{'}}Q(s,a;\theta_{i})| s,a]\)
-
a Q-network can be trained by minimizing a sequence of loss functions \(L_{i}(\theta_{i})\)
-
with weights \(\theta\) as a Q-network
-
such value iterations converge to the optimal action-value function \(Q_{i} \rightarrow Q^{*}\) as \(i \rightarrow \propto\) in practice this is totally impractical
Deep RL
-
the most successful approaches are trained from raw inputs
- using lightweight updates basd on stochastic gradient descent
-
by feeding data into neural networks
- it is possible to learn better feature representations
goal: connect RL algorithm to a deep neural network
-
which operates directly on RGB images
- and efficiently process training data by using stochastic gradient descent updates
Experience Replay and Deep Q Networks
-
experience replay where we store the agent's experiences at each time-step \(e_{t}=(s_{t},a_{t},r_{t},s_{t+1})\) in a dataset \(\mathcal{D} = e_{1},...,e_{N}\)
- pooled over many episodes into a replay memory
-
During the inner loop of the algorithm
-
we apply Q-learning updates
- or minibatch updates
-
to samples of experience \(e \sim \mathcal{D}\)
- drawn at random from the pool of stored samples
-
we apply Q-learning updates
-
after performing experience replay
-
the agent selects and executes
- an action according to an \(\epsilon\) - greedy policy
-
the agent selects and executes
since histories of arbitrary lengths as inputs to a neural network can be difficult
-
our Q-function instead works on fixed length representation of histories
- produced by a function \(\phi\)
algorithm
-
this approach is advantageous compared to online Q-learning
-
each step of experience is potentially used in many weight updates
- allowing for greater data efficiency
-
learning directly from consecutive samples is inefficient
- due to strong correlation between samples…
-
randomizing the samples breaks these corelations
- therefore reduces the variance of the updates
-
when learning on-policy the current parameters determine the next data sample
-
that the parameters are trained on
-
its easy to see that unwanted feedback loops may arise
- and the parameters could get stuck in a poor local minimum or diverge
-
its easy to see that unwanted feedback loops may arise
-
by using experience replay the behaviour distribution
-
is averaged over many of its previous states
-
smoothing out learning and avoiding oscillations
- or divergence in the parameters
-
off-policy, becuase our current parameters are different
- to those used to generate the sample
-
smoothing out learning and avoiding oscillations
-
is averaged over many of its previous states
-
that the parameters are trained on
-
each step of experience is potentially used in many weight updates
-
in practice…
-
algorithm only stores the last \(N\) experience tuples in the replay memory
-
and samples uniformly at random from \(D\) when performing updates
-
limited since the memory buffer does not differentiate important transitions
- and always overwrites with recent transitions due to finite memory size \(N\)
-
similary the uniform sampling gives
- equal importance to all transitions in the replay memory
-
limited since the memory buffer does not differentiate important transitions
-
and samples uniformly at random from \(D\) when performing updates
-
algorithm only stores the last \(N\) experience tuples in the replay memory
a sophisticated sampling strategy can emphasize transitions
- from which we can learn the most… like prioritized sweeping
Deep Q-Learning with Experience replay
- initialize replay memory \(\mathcal{D}\) to capacity \(N\)
- initialize action-value function \(Q\) with random weights
for episode = 1, \(M\) do
- initialize sequence \(s_{1} = \{x_{1}\}\) and preprocessed sequenced \(\phi_{1} = \phi(s_{1})\)
for \(t = 1, T\) do
-
with probability \(\epsilon\) select a random action \(a_{t}\)
- otherwise select \(a_{t} = max_{a}Q^{*}(\phi(s_{t}),a;\theta)\)
- execute action \(a_{t}\) in emulator and observe reward \(r_{t}\) and image \(x_{t+1}\)
- set \(s_{t+1} = s_{t},a_{t},x_{t+1}\) and preprocess \(\phi_{t+1}=\phi(s_{t+1})\)
- store transition \((\phi_{t},a_{t},r_{t},\phi_{t+1})\) in \(\mathcal{D}\)
-
sample random minibatch of transitions \((\phi_{t},a_{t},r_{t},\phi_{t+1})\) from \(\mathcal{D}\)
-
set \(y_{j} = \begin{cases} r_{j} & for terminal \phi_{j+1}\\ r_{j}+\gamma max_{a^{'}}Q(\phi_{j+1},a^{'};\theta) & for non-terminal \phi_{j+1}\end{cases}\)
Perform a gradient descent step on \((y_{j} - Q(\phi_{j},a_{j};\theta))^{2}\)
-
end for end for
Preprocessing and Model Architecture
-
Atari frames are 210 \(\times\) 160 pixel images
- with a 128 color palette
-
raw frames are preprocessed
-
by converting RGB to gray-scale
-
and downsampling to 110 \(\times\) 84 image
- final image is cropped to 84 \(\times\) 84 region of the image note: implementation expects square inputs
-
and downsampling to 110 \(\times\) 84 image
-
by converting RGB to gray-scale
from the experiments
-
the function \(\phi\) from pseudo-algorithm above applies this preprocessing
-
to the last 4 frames of a history
- and stacks them to produce the input to the \(Q\) - function
- the history and the action have been used as inputs to the neural network
-
to the last 4 frames of a history
the main drawback of this architecture
-
a separate forward-pass is required to compute the \(Q\) value
-
of each action, resulting in a cost that scales linearly
- with the number of actions
-
of each action, resulting in a cost that scales linearly
instead we use architecture where there is a separate output unit
-
for each possible action, and only the state representation
- is an input to the neural network
-
output corresponds to the prediced \(Q\) value
-
of the individual action for the input state main advantage: computing Q-values for all possible actions
- in a given state with only a single forwad pass through the network
-
of the individual action for the input state main advantage: computing Q-values for all possible actions
input to the neural networks consists of an \(84 \times 84 \times 4\) image produced by \(\phi\) .
first hidden layer convolves 16 \(8 \times 8\) filters with stride 4 with the input image and applies a rectifier nonlinearity second hidden layer convolves 32 \(4 \times 4\) filters with stride 2 followed by rectifier nonlinearity final hidden layer is fully-connected and consists of 256 rectifier units
- number of valid actions varied between 4 and 18 on the games we considered
DeepQLearning.jl
-
an exploration policy and evaluation policy can be specified in the solver parameters
- exploration policy can be in the form of a function
-
provided will be called as follows:
-
f(policy, env, obs, globalstep, rng) where
- policy is the NN policy being trained,
- env the environment,
- obs the observation at which to take the action,
- globalstep the interaction step of the solver,
- rng a random number generator. also provides by default an epsilon greedy policy with linear decrease of epsilon with globalstep
-
f(policy, env, obs, globalstep, rng) where
- eval policy similarly f(policy, env, neval, maxepisodelength, verbose)
-
qnetwork optins of the solver should accept any Chain object
- MLPs or CNNs followed by dense layer is expected
- dueling options will split all the dense layers at the end
- flattenbatch to flatten all dims but the batch size
- input size of network must be specified when you create a q network
This Pkg exports type AbstractNNPolicy which represents neural network based policy
-
in addition to core functions from POMDPs.jl
- AbstractNNPolicy also supports getnetwork(policy), resetstate!(policy)
for saving and loading models, the solvers saves the network as a bson in…
- solver.logdir/"qnetwork.bson"
Logging is done through TensorBoardLogger.jl, log directory is specified solver options
GPU support has not been tested thoroughly
using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPTools
# MDP model from POMDPModels
mdp = SimpleGridWorld()
# define Q network
model = Chain(Dense(2,32), Dense(32, length(actions(mdp))))
exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))
solver = DeepQLearningSolver(qnetwork = model, max_steps=10000m
exploration_policy= exploration,
learning_rate=0.005, log_freq=500,
recurrence=false, double_q=true,
dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)
sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim,mdp,policy)
println("Total discounted reward for 1 simulation: $r_tot")