Question Classification using Self-Attention Transformer — Part 1.1
Understanding the Multi-Head Self-Attention Transformer network with code in PyTorch
In this part of the blog series, we will be trying to understand the Encoder-Decoder architecture of the Multi-Head Self-Attention Transformer network with some code in PyTorch. There won’t be any theory involved(better theoretical version can be found here) just the barebones of the network and how can one write this network on its own in PyTorch.
The architecture comprising the Transformer model is divided into two parts the Encoder part and the Decoder part. Several other things combine to form the Encoder and Decoder parts. Let’s start with the Encoder.
Encoder
The Encoder part is quite simpler compared to the Decoder part. The Encoder contains N
EncoderLayers and each EncoderLayer contain M
Self-Attention Heads.
Let’s code out every part of the Encoder
- Encoder Module