Attention Is All U Need

  • sequence transduction models
    • transduction/transductive learning
      • transducer is a general name for components converting sounds to energy or vise-versa
    • in general we see transduction is about converting a signal into another form

Transductive learning

  • statistical learning theory that refers to predicting target examples given domain examples

    inductive learning - deriving function from given data deductive learning - deriving the values of given function for points of interest

    transductive learning - deriving values of unknown function for points of interest from given data

    interesting framing of supervised learning

    • approximating a mapping function from data and using it to make a prediction

      the model of estimating the value of a function at a given point of interest describes a new concept of inference

      • moving from the particular to the particular

        • transductive inference

        when one would like to get the best resutl from a restricted amount of information

        transduction is naturally related to a set of algorithms known as instance-based learning

        • k-nearest neighbor is of this type of learning

Transduction in sequence prediction

  • a transducer is narrowly defined as a model that outputs one time step for each input time step provided
    • this maps to a linguistic usage
      • with finite-state transducers

        treat an RNN as a transducer

        • producing output for each input it reads in

conditioned generation, such as Encoder-Decoder architecture

  • is considered a special case of the RNN transducer

More generally, transduction is used in NLP sequence prediction tasks for translation

  • a bit more relaxed than a strict one-output-per-input for FST

"many ML tasks can be expressed as the transformation–or transduction– of input sequences into output sequences"

Recurrent models

  • RNNs, LSTM, gated RNNs are state-of-the-art approaches in sequence modeling and transduction problems

  • recurrent models typically factor computation along the symbol positions of the input and output sequences

  • aligning the positions to steps in computation time

    • they generate a sequence of hidden states \(h_t\)
      • as a function of the previous hidden state \(h_{t-1}\)
        • and the input position \(t\)

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models

  • allowing modeling dependencies
    • without regard to their distance in the input or output sequences

such attention mechanisms are used in conjunction with a recurrent network

a Transformer model refraining from recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization,textual entailment and learning task-independent sentence representations

Model Architecture

  • Most competitive neural sequence transduction models have an encoder-decoder structure

Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequence of continuous representations z = (z1, …, zn). Given z, the decoder then generates an output sequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive consuming the previously generated symbols as additional input when generating the next

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder,

Encoder

  • The encoder is composed of a stack of N = 6 identical layers.
    • Each layer has two sub-layers.
      • The first is a multi-head self-attention mechanism,
        • and the second is a simple, positionwise fully connected feed-forward network.
  • We employ a residual connection around each of the two sub-layers, followed by layer normalization
    • That is, the output of each sub-layer is \(LayerNorm(x + Sublayer(x))\), where Sublayer(x) is the function implemented by the sub-layer itself.
  • To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers,
    • produce outputs of dimension \(d_{model} = 512\)

Decoder

  • The decoder is also composed of a stack of N = 6 identical layers.

    • In addition to the two sub-layers in each encoder layer,
      • the decoder inserts a third sub-layer,
        • which performs multi-head attention over the output of the encoder stack
  • Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

  • We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.

    • This masking, combined with fact that the output embeddings are offset by one position,
      • ensures that the predictions for position i can depend only on the known outputs at positions less than i

Attention

  • An attention function can be described as mapping a query and a set of key-value pairs to an output,
    • where the query, keys, values, and output are all vectors.
    • the output is computed as a weighted sum of the values
      • where the weight assigned to each value
        • is computed by a compatibility function of the query with the corresponding key

Scaled Dot-Product Attention

  • input consists of queries and keys of dimension \(d_k\) and values of dimension \(d_v\)
    • we compute the dot products of the query with all the keys
      • divide each by \(\sqrt{d_k}\)
        • and apply a softmax function to obtain the weights on the values

Practice

compute attention function on a set of queries simultaneously

  • packed together into a matrix \(Q\)

the keys and values are also packed together

  • in matrices \(K\) and \(V\)

matrix of the output is computed as:

\(attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\)

most commonly used attention functions are additive attention and dot-product attention

  • dot-product is identical except for the scaling of \(\frac{1}{\sqrt{d_k}}\)

    additive attention computes compatabiility function using a feed-forward network with a single hidden layer

    • dot-product attention is much faster for highly optimized matrix mult
      • additive attention outperforms it thought without scaling for larger values
        • pushing the softmax function into regions where it has small gradients
          • the counteract is the scaling of the dot product

to illustrate why the dot products get large

  • assume that the components of \(q\) and \(k\) are independent random variables
    • with mean 0 and variance 1
      • their dot product, \(q \cdot k = \sum^{d_k}_{i=1}q_ik_i\) has mean 0 and variance \(d_k\)

Multi-head Attention

  • instead of a single attention function with \(d_{model}\) - dimension keys, value and queries
    • it is beneficial to linearly project them
      • with different learned projections to \(d_k\) and \(d_v\) dimensions, respectively
        • we then perform attention function (in parallel)
          • yielding \(d_v\) -dimensional output values
          these are concatenated and once again projected
          • resulting the final values…
  • multihead attention allows the model to jointly attend to information
    • from different representation subspaces at different positions
      • with a single attention, averaging inhibits this

        \(MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O\)

        • where \({head}_1 = Attention(QW^Q_i,KW^K_i,VW^V_i)\)
  • where the projections are parameter matrices \(W^Q_i \in \mathbb{R}^{d_{model}\times d_k}, W^K_i \in \mathbb{R}^{d_{model}\times d_k},W^V_i \in \mathbb{R}^{d_{model}\times d_k}\)
    • and \(W^O \in \mathbb{R}^{hd_v \times d_{model}}\)

Practice

Transformer architecture employs \(h = 8\) parallel attention layers, or heads

  • for each of these we use \(d_k = d_v = d_{model} / h = 64\) Due to the reduced dimension of each head
    • the total computation cost is similar to that of single-head attention
      • with full dimensionality

Application of Attention in Transformers

Transformers use multi-head attention in

  • encoder-decoder attention
    • the queries come from the previous decoder layer
      • and the memory keys and values come the output of the encoder

enables every position in the decoder to attend over all positions in the input sequencethis mimics the typical encoder-decoder attention mechanisms

  • in sequence-to-sequence models
  • the encoder contains self-attention layers
    • in a self attention layer all of the keys, values, and queries
      • come from the same place
        • in this case, the output of the previous layer in the encoder
      • each position in the encoder can attend to all positions in the previous layer of the encoder
  • self-attention layers in the decoder
    • allow each position in the decoder to attend to all positions in the decoder
      • up to and including that positions
      • we prevent leftward information flow in the decoder to preserve auto-regressive property
    • this is implemented inside of scaled dot-product attention
      • by masking out (setting to negative infinity) all values in the input
        • of the softmax which correspond to illegal connections

Position-wise Feed Forward Networks

  • in addition to sublayers
    • each layer in the encoder and decoder contains a fully connected feed-forward network
      • which is applied to each position separately and identically
        • this consists of linear transformations with a ReLU activation function in between

          \(FFN(x)=max(0,xW_1+b_1)W_2+b_2\)

          • linear transformations are the same across different positions
            • they use different parameters from layer to layer

Practice

  • we can describe this as a 2 convolutions with kernel size 1
    • the dimensionality of input and output is \(d_{model} = 512\)
      • and the inner-layer has dimensionality \(d_{ff} = 2048\)

Embedding and Softmax

  • like most sequence transduction models we use learned embeddings
    • to convert the input tokens and output tokens to vectors of dimension \(d_{model}\)
  • we also use the learned linear transformation and softmax function
    • to convert the decoder output to predict next-token probabilities
  • in our model, we share the same weight matrix between
    • the two embedding layers and the pre-softmax linear transformation

    • we multiply those weights by \(\sqrt{d_{model}}\)

Positional Encoding

  • Transformers contain no recurrence or convolution
    • to make the model make use of the order of the sequence
      • we inject some information about the relative or absolute position
        • of the tokens in the sequence
  • we positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks
    • positional encodings have the same dimension \(d_{model}\) as the embeddings
      • so the two can be summed

there are many choices of positional encodings, learned and fixed

  • we use sine and cosine functions of different frequenceies \(PE_(pos,2i) = sin(pos/10000^{2i/d_{model}}\) \(PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}}\)

    • where \(pos\) is the position and \(i\) is the dimension
      • each dimension of the positional encoding corresponds to a sinusoid
    • the wavelengths form a geometric progression from \(2\pi\) to 10000 \(\cdot 2\pi\)
      • this is chosen to allow the model to easily learn to attend
        • by relative positions
          • since for any fixed offset \(k,PE_{pos+k}\)
          • can be represented as a linear function of \(PE_{pos}\)

experimented with learned positional embeddings and found they produced nearly identical results

  • sinusoidal allows model to extrapolate to sequence lengths longer
    • than the ones encountered during training

Why Self-Attention

  • the total computational complexity per layer
  • the amount of computation that can be parallelized
    • as measured by the minimum number of sequential operations required
  • the path length between long-range dependencies in the network learning long-range dependencies is a key challenge in sequence transduction tasks
    • one key factor affecting the ability to learn such dependencies is the length
      • of the paths forward and backward signals have to traverse in the network
    • the shorter these paths between any combination of positions in the input and and output sequences
      • the easier it is to learn long-range dependencies

they also compare the max path length between any two input/output positions in networks composed of the different layer types

a self-attention layer connects all positions with a constant number of sequentially executed operations

  • whereas a recurrent layer requires \(O(n)\) sequential operations

in terms of complexity, self-attention layers are faster than recurrent layers

  • when the sequence length \(n\) is smaller than the representation dimensionality \(d\)
    • which is most often the case with sentence representations used be state-of-the-art models in machine translations
  • to improve computational performance for tasks involving very long sentences
    • self-attention could be restricted to considering only a neighborhood of size \(r\) in the input sequence centered around the respective output position
      • this increases the max path length to \(O(n/r)\)

a single convolutional layer with kernel width \(k< n\) does not connect all pairs of input and output positions

  • doing so requires \(O(n/k)\) convolutional layers in the case of contiguous kernels
    • or \(O(\log_k(n))\) in the case of dilated convolutions, increasing the length of the longest paths between any two positions by the network

Convolutional layers are expensive, usually more than recurrent layers by a factor \(k\) .

separable convolutions decrease the complexity considerably to \(O(k\cdot n \cdot d + n \cdot d^2)\)

  • even with \(k = n\) , the complexity of a seperable convolution is equal to the combination of a self-attention layer and a point-wise feed forward layer

also self-attention could yield more interpretable models