Transformer Architecture

Introduction

Figure 1. Transformer Encoder-Decoder Architecture (Vaswani, Shazeer, Parmar et al. 2017). The model has two high-level components: the encoder (left) and decoder (right). The encoder on the left receives Inputs and provides processed values to the decoder on the right through a process called cross-attention. The decoder receives prior Outputs and generates probabilities for the next token.

The Transformer architecture was introduced by Vaswani, Shazeer, Parmar et al. (2017) in the seminal paper Attention Is All You Need as a text-to-text model. Variations of the transformer architecture have been the basis for many state of the art language and vision models its inception in 2017. This page is a high-level summary of important components and concepts in the transformer architecture and training transformer-based models. Topic pages are ongoing works-in-progress to track developments.

Model Components

The transformer architecture, as introduced by Vaswani et al. (2017) has two high-level components: the encoder and decoder (see Figure 1). The encoder and decoder are both composed of a series of blocks of the same design. The major sub-components of the transformer blocks are:

Attention
Normalization
Feed Forward or Multi-Layered Perceptrons
Residual Connections

Transformer architectures also can include components for:

Token Embeddings
Positional Encoding
Model Head

Training

Training is a process of adapting the model to achieve a task such text generation or translation. Training transformer-based models typically requires large amounts of data, computer hardware for performing training, and software packages for calculating the error or loss and loss gradients with respect to model parameters. Different transformer architectures are also suited to different training objectives. Topics include:

Data
Training Objectives
Loss Functions
Optimization
Hardware Acceleration