The complete guide to implementing a Transformer from scratch
Introduction
As the title suggests, in this article I am going to implement the Transformer architecture from scratch with PyTorch — yes, literally from scratch. Before we get into it, let me provide a brief overview of the architecture. Transformer was first introduced in a paper titled “Attention Is All You Need” written by Vaswani et al. back in 2017 [1]. This neural network model is designed to perform seq2seq (Sequence-to-Sequence) tasks, where it accepts a sequence as the input and is expected to return another sequence for the output such as machine translation and question answering.
Before Transformer was introduced, we usually used RNN-based models like LSTM or GRU to accomplish seq2seq tasks. These models are indeed capable of capturing context, yet they do so in a sequential manner. This approach makes it challenging to capture long-range dependencies, especially when the important context is very far behind the current timestep. In contrast, Transformer can freely attend any parts of the sequence that it considers important without being constrained by sequential processing.