sequence transduction models
- transduction/transductive learning
  - transducer is a general name for components converting sounds to energy or vise-versa
- in general we see transduction is about converting a signal into another form

Transductive learning

statistical learning theory that refers to predicting target examples given domain examples

inductive learning - deriving function from given data deductive learning - deriving the values of given function for points of interest

transductive learning - deriving values of unknown function for points of interest from given data

interesting framing of supervised learning
- approximating a mapping function from data and using it to make a prediction
  
  the model of estimating the value of a function at a given point of interest describes a new concept of inference
  - moving from the particular to the particular
    - transductive inference
    when one would like to get the best resutl from a restricted amount of information
    
    transduction is naturally related to a set of algorithms known as instance-based learning
    - k-nearest neighbor is of this type of learning

Transduction in sequence prediction

a transducer is narrowly defined as a model that outputs one time step for each input time step provided
- this maps to a linguistic usage
  - with finite-state transducers
    
    treat an RNN as a transducer
    - producing output for each input it reads in

conditioned generation, such as Encoder-Decoder architecture

is considered a special case of the RNN transducer

More generally, transduction is used in NLP sequence prediction tasks for translation

a bit more relaxed than a strict one-output-per-input for FST

"many ML tasks can be expressed as the transformation–or transduction– of input sequences into output sequences"

Recurrent models

RNNs, LSTM, gated RNNs are state-of-the-art approaches in sequence modeling and transduction problems
recurrent models typically factor computation along the symbol positions of the input and output sequences
aligning the positions to steps in computation time
- they generate a sequence of hidden states \(h_t\)
  - as a function of the previous hidden state \(h_{t-1}\)
    - and the input position \(t\)

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models

allowing modeling dependencies
- without regard to their distance in the input or output sequences

such attention mechanisms are used in conjunction with a recurrent network

a Transformer model refraining from recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization,textual entailment and learning task-independent sentence representations

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure

Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive consuming the previously generated symbols as additional input when generating the next

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder,

Encoder

The encoder is composed of a stack of N = 6 identical layers.
- Each layer has two sub-layers.
  - The first is a multi-head self-attention mechanism,
    - and the second is a simple, positionwise fully connected feed-forward network.
We employ a residual connection around each of the two sub-layers, followed by layer normalization
- That is, the output of each sub-layer is \(LayerNorm(x + Sublayer(x))\), where Sublayer(x) is the function implemented by the sub-layer itself.
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers,
- produce outputs of dimension \(d_{model} = 512\)

Decoder

The decoder is also composed of a stack of N = 6 identical layers.
- In addition to the two sub-layers in each encoder layer,
  - the decoder inserts a third sub-layer,
    - which performs multi-head attention over the output of the encoder stack
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
- This masking, combined with fact that the output embeddings are offset by one position,
  - ensures that the predictions for position i can depend only on the known outputs at positions less than i

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,
- where the query, keys, values, and output are all vectors.
- the output is computed as a weighted sum of the values
  - where the weight assigned to each value
    - is computed by a compatibility function of the query with the corresponding key

Scaled Dot-Product Attention

input consists of queries and keys of dimension \(d_k\) and values of dimension \(d_v\)
- we compute the dot products of the query with all the keys
  - divide each by \(\sqrt{d_k}\)
    - and apply a softmax function to obtain the weights on the values

Practice

compute attention function on a set of queries simultaneously

packed together into a matrix \(Q\)

the keys and values are also packed together

in matrices \(K\) and \(V\)

matrix of the output is computed as:

\(attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\)

most commonly used attention functions are additive attention and dot-product attention

dot-product is identical except for the scaling of \(\frac{1}{\sqrt{d_k}}\)

additive attention computes compatabiility function using a feed-forward network with a single hidden layer
- dot-product attention is much faster for highly optimized matrix mult
  - additive attention outperforms it thought without scaling for larger values
    - pushing the softmax function into regions where it has small gradients
      - the counteract is the scaling of the dot product

to illustrate why the dot products get large

assume that the components of \(q\) and \(k\) are independent random variables
- with mean 0 and variance 1
  - their dot product, \(q \cdot k = \sum^{d_k}_{i=1}q_ik_i\) has mean 0 and variance \(d_k\)

Multi-head Attention

instead of a single attention function with \(d_{model}\) - dimension keys, value and queries
- it is beneficial to linearly project them
  - with different learned projections to \(d_k\) and \(d_v\) dimensions, respectively
    - we then perform attention function (in parallel)
      - yielding \(d_v\) -dimensional output values
      these are concatenated and once again projected
      - resulting the final values…
multihead attention allows the model to jointly attend to information
- from different representation subspaces at different positions
  - with a single attention, averaging inhibits this
    
    \(MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O\)
    - where \({head}_1 = Attention(QW^Q_i,KW^K_i,VW^V_i)\)

where the projections are parameter matrices \(W^Q_i \in \mathbb{R}^{d_{model}\times d_k}, W^K_i \in \mathbb{R}^{d_{model}\times d_k},W^V_i \in \mathbb{R}^{d_{model}\times d_k}\)
- and \(W^O \in \mathbb{R}^{hd_v \times d_{model}}\)

Practice

Transformer architecture employs \(h = 8\) parallel attention layers, or heads

for each of these we use \(d_k = d_v = d_{model} / h = 64\) Due to the reduced dimension of each head
- the total computation cost is similar to that of single-head attention
  - with full dimensionality

Application of Attention in Transformers

Transformers use multi-head attention in

encoder-decoder attention
- the queries come from the previous decoder layer
  - and the memory keys and values come the output of the encoder

enables every position in the decoder to attend over all positions in the input sequence — this mimics the typical encoder-decoder attention mechanisms

in sequence-to-sequence models

the encoder contains self-attention layers
- in a self attention layer all of the keys, values, and queries
  - come from the same place
    - in this case, the output of the previous layer in the encoder
  - each position in the encoder can attend to all positions in the previous layer of the encoder
self-attention layers in the decoder
- allow each position in the decoder to attend to all positions in the decoder
  - up to and including that positions
  - we prevent leftward information flow in the decoder to preserve auto-regressive property
- this is implemented inside of scaled dot-product attention
  - by masking out (setting to negative infinity) all values in the input
    - of the softmax which correspond to illegal connections

Position-wise Feed Forward Networks

in addition to sublayers
- each layer in the encoder and decoder contains a fully connected feed-forward network
  - which is applied to each position separately and identically
    - this consists of linear transformations with a ReLU activation function in between
      
      \(FFN(x)=max(0,xW_1+b_1)W_2+b_2\)
      - linear transformations are the same across different positions
        
        they use different parameters from layer to layer

Practice

we can describe this as a 2 convolutions with kernel size 1
- the dimensionality of input and output is \(d_{model} = 512\)
  - and the inner-layer has dimensionality \(d_{ff} = 2048\)

Embedding and Softmax

like most sequence transduction models we use learned embeddings
- to convert the input tokens and output tokens to vectors of dimension \(d_{model}\)
we also use the learned linear transformation and softmax function
- to convert the decoder output to predict next-token probabilities
in our model, we share the same weight matrix between
- the two embedding layers and the pre-softmax linear transformation
- we multiply those weights by \(\sqrt{d_{model}}\)

Positional Encoding

Transformers contain no recurrence or convolution
- to make the model make use of the order of the sequence
  - we inject some information about the relative or absolute position
    - of the tokens in the sequence
we positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks
- positional encodings have the same dimension \(d_{model}\) as the embeddings
  - so the two can be summed

there are many choices of positional encodings, learned and fixed

we use sine and cosine functions of different frequenceies \(PE_(pos,2i) = sin(pos/10000^{2i/d_{model}}\) \(PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}}\)
- where \(pos\) is the position and \(i\) is the dimension
  - each dimension of the positional encoding corresponds to a sinusoid
- the wavelengths form a geometric progression from \(2\pi\) to 10000 \(\cdot 2\pi\)
  - this is chosen to allow the model to easily learn to attend
    - by relative positions
      - since for any fixed offset \(k,PE_{pos+k}\)
      - can be represented as a linear function of \(PE_{pos}\)

experimented with learned positional embeddings and found they produced nearly identical results

sinusoidal allows model to extrapolate to sequence lengths longer
- than the ones encountered during training

Why Self-Attention

the total computational complexity per layer
the amount of computation that can be parallelized
- as measured by the minimum number of sequential operations required
the path length between long-range dependencies in the network learning long-range dependencies is a key challenge in sequence transduction tasks
- one key factor affecting the ability to learn such dependencies is the length
  - of the paths forward and backward signals have to traverse in the network
- the shorter these paths between any combination of positions in the input and and output sequences
  - the easier it is to learn long-range dependencies

they also compare the max path length between any two input/output positions in networks composed of the different layer types

a self-attention layer connects all positions with a constant number of sequentially executed operations

whereas a recurrent layer requires \(O(n)\) sequential operations

in terms of complexity, self-attention layers are faster than recurrent layers

when the sequence length \(n\) is smaller than the representation dimensionality \(d\)
- which is most often the case with sentence representations used be state-of-the-art models in machine translations
to improve computational performance for tasks involving very long sentences
- self-attention could be restricted to considering only a neighborhood of size \(r\) in the input sequence centered around the respective output position
  - this increases the max path length to \(O(n/r)\)

a single convolutional layer with kernel width \(k< n\) does not connect all pairs of input and output positions

doing so requires \(O(n/k)\) convolutional layers in the case of contiguous kernels
- or \(O(\log_k(n))\) in the case of dilated convolutions, increasing the length of the longest paths between any two positions by the network

Convolutional layers are expensive, usually more than recurrent layers by a factor \(k\) .

separable convolutions decrease the complexity considerably to \(O(k\cdot n \cdot d + n \cdot d^2)\)

even with \(k = n\) , the complexity of a seperable convolution is equal to the combination of a self-attention layer and a point-wise feed forward layer

also self-attention could yield more interpretable models

Attention Is All U Need

Transductive learning

Recurrent models

Model Architecture

Encoder

Decoder

Attention

Scaled Dot-Product Attention

Practice

Multi-head Attention

Practice

Application of Attention in Transformers

Position-wise Feed Forward Networks

Practice

Embedding and Softmax

Positional Encoding

Why Self-Attention