This series of blogs will go through the coding of Self-Attention Transformers from scratch in PyTorch, Text Classification using the Self-Attention Transformer in PyTorch, and Different Classification strategies to solve classification problems with multiple categories with each category having some number of classes. This series is divided into four parts
- Part — 1(this): Introduction to Dataset and steps for Data Preparation and pre-processing
- Part — 1.1(optional): Building and understanding the Multi-Head Self-Attention Transformer network with code in PyTorch
- Part — 2: Simple Classification technique to classify different categories with some number of classes in each category
- Part — 3: One Encoder
NDecoder strategy to classify different categories with some number of classes in each category
Introduction to Dataset and steps for Data Preparation and pre-processing
There are many ways one can achieve good accuracy for classification a sentence or a paragraph into different categories. As the title suggests in these series of blogs we will discuss one of the most talked-about and employed model architecture. Instead of using the transformers library by HuggingFace or any other pre-trained models, we will code a Multi-Head Self Attention Transformer using PyTorch. To make things more fun and somewhat complicated the dataset that we will be training with has two sets of categories and we will discuss and implement different approaches to achieve a good classification model which can classify text into two different sets of categories each having several classes.
About the Dataset
The dataset we will be using is a question classification dataset. The two sets of categories provide information about what type of answer would be required for a question asked. You can find the dataset here.
For example, the question asked is What are liver enzymes? This question requires a descriptive text and most suitably a definition. So here, the class is descriptive text and the subclass is the definition.
How does the data look?
You can do a
wget with the link address of the download files to download the data. We will be using the
toknizers library to convert the question texts to tokens. As the hugging face tokenizers library is written in
rust and faster than any python implementation we are leveraging it. You can also try to use
BytePairEncoding library available here to convert questions to tokens. It's much slower than the hugging face tokenizers.
Once you download the data we will clean the sentences and get our class and subclass labels using the following steps.
- Imports and Data Loading
- Decoding a line from bytes to string
- All questions to list
- Strings to class, Sub-class and Questions
- Constructed list of the dictionaries above will look something like this
- Converting a list of dictionaries to a dataframe
- Class Names to indexes and vice-versa
In total there are 6 classes
- Save these
idx2classsafely. I generally write these to
- Repeating the above two steps for subclasses
There are 47 subclasses in total
- Mapping the classes and subclasses inside the dataframe to their indexes
- Let’s tokenize the question texts
We will convert the text to computer understandable numbers as we did for labels. This vocabulary file is obtained after training
BertWordPieceTokenizer on the wikitext data and the vocabulary size is 10k. You can download it from here.
Let’s start tokenizing, we separately have a list which stores the number of tokens for every question. The longest sequence has 52 tokens that form the question. We can have the maximum sequence length to 100.
- Save the
outputslist to a pickle
This notebook can be followed to implement all of the above.
Now our data is prepared and tokenized and is available in a numerical form. You can directly move to Part — 2 where we start with the model architecture and training or you can go through Part — 1.1 where the Multi-Head Self-Attention Transformer network is built from scratch in PyTorch and then move on. The choice is yours 😃.
The code for all the parts is available in this GitHub repo.
If this article helped you in any which way possible and you liked it, please appreciate it by sharing it in among your community. If there are any mistakes feel free to point those out by commenting down below.
To know more about me please click here and if you find something interesting just shoot me a mail and we could have a chat over a cup of ☕️.