Transformer Architecture
Introduction
editThe Transformer architecture was introduced by Vaswani, Shazeer, Parmar et al. (2017) in the seminal paper Attention Is All You Need as a text-to-text model. Variations of the transformer architecture have been the basis for many state of the art language and vision models its inception in 2017. This page is a high-level summary of important components and concepts in the transformer architecture and training transformer-based models. Topic pages are ongoing works-in-progress to track developments.
Model Components
editThe transformer architecture, as introduced by Vaswani et al. (2017) has two high-level components: the encoder and decoder (see Figure 1). The encoder and decoder are both composed of a series of blocks of the same design. The major sub-components of the transformer blocks are:
- Attention
- Normalization
- Feed Forward or Multi-Layered Perceptrons
- Residual Connections
Transformer architectures also can include components for:
- Token Embeddings
- Positional Encoding
- Model Head
Training
editTraining is a process of adapting the model to achieve a task such text generation or translation. Training transformer-based models typically requires large amounts of data, computer hardware for performing training, and software packages for calculating the error or loss and loss gradients with respect to model parameters. Different transformer architectures are also suited to different training objectives. Topics include:
- Data
- Training Objectives
- Loss Functions
- Optimization
- Hardware Acceleration