-
sequence transduction models
-
transduction/transductive learning
- transducer is a general name for components converting sounds to energy or vise-versa
- in general we see transduction is about converting a signal into another form
-
transduction/transductive learning
Transductive learning
-
statistical learning theory that refers to predicting target examples given domain examples
inductive learning - deriving function from given data deductive learning - deriving the values of given function for points of interest
transductive learning - deriving values of unknown function for points of interest from given data
interesting framing of supervised learning
-
approximating a mapping function from data and using it to make a prediction
the model of estimating the value of a function at a given point of interest describes a new concept of inference
-
moving from the particular to the particular
- transductive inference
when one would like to get the best resutl from a restricted amount of information
transduction is naturally related to a set of algorithms known as instance-based learning
- k-nearest neighbor is of this type of learning
-
-
Transduction in sequence prediction
-
a transducer is narrowly defined as a model that outputs one time step for each input time step provided
-
this maps to a linguistic usage
-
with finite-state transducers
treat an RNN as a transducer
- producing output for each input it reads in
-
-
this maps to a linguistic usage
conditioned generation, such as Encoder-Decoder architecture
- is considered a special case of the RNN transducer
More generally, transduction is used in NLP sequence prediction tasks for translation
- a bit more relaxed than a strict one-output-per-input for FST
"many ML tasks can be expressed as the transformation–or transduction– of input sequences into output sequences"
Recurrent models
-
RNNs, LSTM, gated RNNs are state-of-the-art approaches in sequence modeling and transduction problems
-
recurrent models typically factor computation along the symbol positions of the input and output sequences
-
aligning the positions to steps in computation time
-
they generate a sequence of hidden states \(h_t\)
-
as a function of the previous hidden state \(h_{t-1}\)
- and the input position \(t\)
-
as a function of the previous hidden state \(h_{t-1}\)
-
they generate a sequence of hidden states \(h_t\)
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models
-
allowing modeling dependencies
- without regard to their distance in the input or output sequences
such attention mechanisms are used in conjunction with a recurrent network
a Transformer model refraining from recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization,textual entailment and learning task-independent sentence representations
Model Architecture
- Most competitive neural sequence transduction models have an encoder-decoder structure
Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive consuming the previously generated symbols as additional input when generating the next
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder,
Encoder
-
The encoder is composed of a stack of N = 6 identical layers.
-
Each layer has two sub-layers.
-
The first is a multi-head self-attention mechanism,
- and the second is a simple, positionwise fully connected feed-forward network.
-
The first is a multi-head self-attention mechanism,
-
Each layer has two sub-layers.
-
We employ a residual connection around each of the two sub-layers, followed by layer normalization
- That is, the output of each sub-layer is \(LayerNorm(x + Sublayer(x))\), where Sublayer(x) is the function implemented by the sub-layer itself.
-
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers,
- produce outputs of dimension \(d_{model} = 512\)
Decoder
-
The decoder is also composed of a stack of N = 6 identical layers.
-
In addition to the two sub-layers in each encoder layer,
-
the decoder inserts a third sub-layer,
- which performs multi-head attention over the output of the encoder stack
-
the decoder inserts a third sub-layer,
-
In addition to the two sub-layers in each encoder layer,
-
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
-
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
-
This masking, combined with fact that the output embeddings are offset by one position,
- ensures that the predictions for position i can depend only on the known outputs at positions less than i
-
This masking, combined with fact that the output embeddings are offset by one position,
Attention
-
An attention function can be described as mapping a query and a set of key-value pairs to an output,
- where the query, keys, values, and output are all vectors.
-
the output is computed as a weighted sum of the values
-
where the weight assigned to each value
- is computed by a compatibility function of the query with the corresponding key
-
where the weight assigned to each value
Scaled Dot-Product Attention
-
input consists of queries and keys of dimension \(d_k\) and values of dimension \(d_v\)
-
we compute the dot products of the query with all the keys
-
divide each by \(\sqrt{d_k}\)
- and apply a softmax function to obtain the weights on the values
-
divide each by \(\sqrt{d_k}\)
-
we compute the dot products of the query with all the keys
Practice
compute attention function on a set of queries simultaneously
- packed together into a matrix \(Q\)
the keys and values are also packed together
- in matrices \(K\) and \(V\)
matrix of the output is computed as:
\(attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\)
most commonly used attention functions are additive attention and dot-product attention
-
dot-product is identical except for the scaling of \(\frac{1}{\sqrt{d_k}}\)
additive attention computes compatabiility function using a feed-forward network with a single hidden layer
-
dot-product attention is much faster for highly optimized matrix mult
-
additive attention outperforms it thought without scaling for larger values
-
pushing the softmax function into regions where it has small gradients
- the counteract is the scaling of the dot product
-
pushing the softmax function into regions where it has small gradients
-
additive attention outperforms it thought without scaling for larger values
-
dot-product attention is much faster for highly optimized matrix mult
to illustrate why the dot products get large
-
assume that the components of \(q\) and \(k\) are independent random variables
-
with mean 0 and variance 1
- their dot product, \(q \cdot k = \sum^{d_k}_{i=1}q_ik_i\) has mean 0 and variance \(d_k\)
-
with mean 0 and variance 1
Multi-head Attention
-
instead of a single attention function with \(d_{model}\) - dimension keys, value and queries
-
it is beneficial to linearly project them
-
with different learned projections to \(d_k\) and \(d_v\) dimensions, respectively
-
we then perform attention function (in parallel)
- yielding \(d_v\) -dimensional output values
- resulting the final values…
-
we then perform attention function (in parallel)
-
with different learned projections to \(d_k\) and \(d_v\) dimensions, respectively
-
it is beneficial to linearly project them
-
multihead attention allows the model to jointly attend to information
-
from different representation subspaces at different positions
-
with a single attention, averaging inhibits this
\(MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O\)
- where \({head}_1 = Attention(QW^Q_i,KW^K_i,VW^V_i)\)
-
-
from different representation subspaces at different positions
-
where the projections are parameter matrices \(W^Q_i \in \mathbb{R}^{d_{model}\times d_k}, W^K_i \in \mathbb{R}^{d_{model}\times d_k},W^V_i \in \mathbb{R}^{d_{model}\times d_k}\)
- and \(W^O \in \mathbb{R}^{hd_v \times d_{model}}\)
Practice
Transformer architecture employs \(h = 8\) parallel attention layers, or heads
-
for each of these we use \(d_k = d_v = d_{model} / h = 64\) Due to the reduced dimension of each head
-
the total computation cost is similar to that of single-head attention
- with full dimensionality
-
the total computation cost is similar to that of single-head attention
Application of Attention in Transformers
Transformers use multi-head attention in
-
encoder-decoder attention
-
the queries come from the previous decoder layer
- and the memory keys and values come the output of the encoder
-
the queries come from the previous decoder layer
enables every position in the decoder to attend over all positions in the input sequence — this mimics the typical encoder-decoder attention mechanisms
- in sequence-to-sequence models
-
the encoder contains self-attention layers
-
in a self attention layer all of the keys, values, and queries
-
come from the same place
- in this case, the output of the previous layer in the encoder
- each position in the encoder can attend to all positions in the previous layer of the encoder
-
come from the same place
-
in a self attention layer all of the keys, values, and queries
-
self-attention layers in the decoder
-
allow each position in the decoder to attend to all positions in the decoder
- up to and including that positions
- we prevent leftward information flow in the decoder to preserve auto-regressive property
-
this is implemented inside of scaled dot-product attention
-
by masking out (setting to negative infinity) all values in the input
- of the softmax which correspond to illegal connections
-
by masking out (setting to negative infinity) all values in the input
-
allow each position in the decoder to attend to all positions in the decoder
Position-wise Feed Forward Networks
-
in addition to sublayers
-
each layer in the encoder and decoder contains a fully connected feed-forward network
-
which is applied to each position separately and identically
-
this consists of linear transformations with a ReLU activation function in between
\(FFN(x)=max(0,xW_1+b_1)W_2+b_2\)
-
linear transformations are the same across different positions
- they use different parameters from layer to layer
-
linear transformations are the same across different positions
-
-
which is applied to each position separately and identically
-
each layer in the encoder and decoder contains a fully connected feed-forward network
Practice
-
we can describe this as a 2 convolutions with kernel size 1
-
the dimensionality of input and output is \(d_{model} = 512\)
- and the inner-layer has dimensionality \(d_{ff} = 2048\)
-
the dimensionality of input and output is \(d_{model} = 512\)
Embedding and Softmax
-
like most sequence transduction models we use learned embeddings
- to convert the input tokens and output tokens to vectors of dimension \(d_{model}\)
-
we also use the learned linear transformation and softmax function
- to convert the decoder output to predict next-token probabilities
-
in our model, we share the same weight matrix between
-
the two embedding layers and the pre-softmax linear transformation
-
we multiply those weights by \(\sqrt{d_{model}}\)
-
Positional Encoding
-
Transformers contain no recurrence or convolution
-
to make the model make use of the order of the sequence
-
we inject some information about the relative or absolute position
- of the tokens in the sequence
-
we inject some information about the relative or absolute position
-
to make the model make use of the order of the sequence
-
we positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks
-
positional encodings have the same dimension \(d_{model}\) as the embeddings
- so the two can be summed
-
positional encodings have the same dimension \(d_{model}\) as the embeddings
there are many choices of positional encodings, learned and fixed
-
we use sine and cosine functions of different frequenceies \(PE_(pos,2i) = sin(pos/10000^{2i/d_{model}}\) \(PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}}\)
-
where \(pos\) is the position and \(i\) is the dimension
- each dimension of the positional encoding corresponds to a sinusoid
-
the wavelengths form a geometric progression from \(2\pi\) to 10000 \(\cdot 2\pi\)
-
this is chosen to allow the model to easily learn to attend
-
by relative positions
- since for any fixed offset \(k,PE_{pos+k}\)
- can be represented as a linear function of \(PE_{pos}\)
-
by relative positions
-
this is chosen to allow the model to easily learn to attend
-
where \(pos\) is the position and \(i\) is the dimension
experimented with learned positional embeddings and found they produced nearly identical results
-
sinusoidal allows model to extrapolate to sequence lengths longer
- than the ones encountered during training
Why Self-Attention
- the total computational complexity per layer
-
the amount of computation that can be parallelized
- as measured by the minimum number of sequential operations required
-
the path length between long-range dependencies in the network learning long-range dependencies is a key challenge in sequence transduction tasks
-
one key factor affecting the ability to learn such dependencies is the length
- of the paths forward and backward signals have to traverse in the network
-
the shorter these paths between any combination of positions in the input and and output sequences
- the easier it is to learn long-range dependencies
-
one key factor affecting the ability to learn such dependencies is the length
they also compare the max path length between any two input/output positions in networks composed of the different layer types
a self-attention layer connects all positions with a constant number of sequentially executed operations
- whereas a recurrent layer requires \(O(n)\) sequential operations
in terms of complexity, self-attention layers are faster than recurrent layers
-
when the sequence length \(n\) is smaller than the representation dimensionality \(d\)
- which is most often the case with sentence representations used be state-of-the-art models in machine translations
-
to improve computational performance for tasks involving very long sentences
-
self-attention could be restricted to considering only a neighborhood of size \(r\) in the input sequence centered around the respective output position
- this increases the max path length to \(O(n/r)\)
-
self-attention could be restricted to considering only a neighborhood of size \(r\) in the input sequence centered around the respective output position
a single convolutional layer with kernel width \(k< n\) does not connect all pairs of input and output positions
-
doing so requires \(O(n/k)\) convolutional layers in the case of contiguous kernels
- or \(O(\log_k(n))\) in the case of dilated convolutions, increasing the length of the longest paths between any two positions by the network
Convolutional layers are expensive, usually more than recurrent layers by a factor \(k\) .
separable convolutions decrease the complexity considerably to \(O(k\cdot n \cdot d + n \cdot d^2)\)
- even with \(k = n\) , the complexity of a seperable convolution is equal to the combination of a self-attention layer and a point-wise feed forward layer
also self-attention could yield more interpretable models