Question Classification using Self-Attention Transformer — Part 1.1

Vatsal Saglani
5 min readJan 1, 2021

Understanding the Multi-Head Self-Attention Transformer network with code in PyTorch

In this part of the blog series, we will be trying to understand the Encoder-Decoder architecture of the Multi-Head Self-Attention Transformer network with some code in PyTorch. There won’t be any theory involved(better theoretical version can be found here) just the barebones of the network and how can one write this network on its own in PyTorch.

The architecture comprising the Transformer model is divided into two parts the Encoder part and the Decoder part. Several other things combine to form the Encoder and Decoder parts. Let’s start with the Encoder.

Encoder

The Encoder part is quite simpler compared to the Decoder part. The Encoder contains N EncoderLayers and each EncoderLayer contain M Self-Attention Heads.

Let’s code out every part of the Encoder

  • Encoder Module

--

--

Vatsal Saglani
Vatsal Saglani

Written by Vatsal Saglani

Data Science Lead - GenAI. A Software Engineer, Programmer & Deep Learning professional. https://vatsalsaglani.vercel.app

No responses yet