Question Classification using Self-Attention Transformer — Part 1

This series of blogs will go through the coding of Self-Attention Transformers from scratch in PyTorch, Text Classification using the Self-Attention Transformer in PyTorch, and Different Classification strategies to solve classification problems with multiple categories with each category having some number of classes. This series is divided into four parts

Introduction to Dataset and steps for Data Preparation and pre-processing

There are many ways one can achieve good accuracy for classification a sentence or a paragraph into different categories. As the title suggests in these series of blogs we will discuss one of the most talked-about and employed model architecture. Instead of using the transformers library by HuggingFace or any other pre-trained models, we will code a Multi-Head Self Attention Transformer using PyTorch. To make things more fun and somewhat complicated the dataset that we will be training with has two sets of categories and we will discuss and implement different approaches to achieve a good classification model which can classify text into two different sets of categories each having several classes.

About the Dataset

The dataset we will be using is a question classification dataset. The two sets of categories provide information about what type of answer would be required for a question asked. You can find the dataset here.

For example, the question asked is What are liver enzymes? This question requires a descriptive text and most suitably a definition. So here, the class is descriptive text and the subclass is the definition.

How does the data look?

Data Pre-processing

You can do a wget with the link address of the download files to download the data. We will be using the toknizers library to convert the question texts to tokens. As the hugging face tokenizers library is written in rust and faster than any python implementation we are leveraging it. You can also try to use BytePairEncoding library available here to convert questions to tokens. It's much slower than the hugging face tokenizers.

Once you download the data we will clean the sentences and get our class and subclass labels using the following steps.

In total there are 6 classes

There are 47 subclasses in total

We will convert the text to computer understandable numbers as we did for labels. This vocabulary file is obtained after training BertWordPieceTokenizer on the wikitext data and the vocabulary size is 10k. You can download it from here.

Let’s start tokenizing, we separately have a list which stores the number of tokens for every question. The longest sequence has 52 tokens that form the question. We can have the maximum sequence length to 100.

This notebook can be followed to implement all of the above.

Now our data is prepared and tokenized and is available in a numerical form. You can directly move to Part — 2 where we start with the model architecture and training or you can go through Part — 1.1 where the Multi-Head Self-Attention Transformer network is built from scratch in PyTorch and then move on. The choice is yours 😃.

The code for all the parts is available in this GitHub repo.

If this article helped you in any which way possible and you liked it, please appreciate it by sharing it in among your community. If there are any mistakes feel free to point those out by commenting down below.

To know more about me please click here and if you find something interesting just shoot me a mail and we could have a chat over a cup of ☕️.

Machine Learning @ Quinnox | Love to write about Deep Learning for NLP and Computer Vision, Model Deployment, and ReactJS.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store