Recurrent Neural Networks/Attention is All you Need

This is a page for notes on Vaswani et al. 2017 Attention is All You Need.

Readings

What is an encoder-decoder recurrent neural network (RNN)?
What improvements does the attention mechanism provide over RNN encoder-decoder?
How does the model architecture in Vaswani et al. 2017 differ from the existing applications of attention in natural language processing?
What is the difference between self-attention and regular attention and what are the benefits of the former as compared to the later?
What is the difference between multi-head attention and regular attention and what are the benefits of the former as compared to the later?