The Concept of Transformers and Training A Transformers Model

Step by step guide on how transformer networks work

Ayoola Olafenwa

Published in

Towards Data Science

13 min readOct 28, 2022

What is Natural Language Processing (NLP)

Natural Language Processing is the branch of artificial intelligence that deals with giving machines the ability to understand human languages. It makes it possible for machines to read texts, understand the grammatical structures and interpret the contextual meaning of words used in sentences. It is used in machine translation to translate from one language to another, and the commonly used NLP translator is Google Translate. Google Translate can be used to translate documents and websites from one language to another with support for 133 different languages. OpenAI GPT-3 is one of the most advanced NLP models created, and it performs a wide variety of language tasks such as text generation, question-answering and text summarization. Sentiment analysis is an important branch of Natural Language Processing that is used by organizations in analyzing product reviews, to differentiate positive and negative reviews. Text generation is one of the interesting fields of NLP, and is used in mobile phones autocomplete features for appropriate word suggestions, and in completing our sentences.

There are different branches of NLP and I will explain a few of them.

Sentiment Analysis: It is the analysis of texts and classification of the sentiments of texts as either positive or negative.
Text Generation: It is the generation of texts, in text generation we provide word prompts, and the NLP model autocompletes the sentence.
Text Summarization: It is the use of NLP techniques to summarize long sentences to short sentences.
Language Translation: It is the use of Natural Language models to translate a text from one language to another, example translation of English sentences to French sentences.
Masked Language Modeling: It is the prediction of a masked word in a sentence using NLP model.

What is Transformer Network

Transformer is a neural network architecture that is designed to solve natural language processing tasks. Transformer networks use a mechanism called attention mechanism to study, understand words’ context used in sentences, and extract useful information from them. Transformer was introduced in the popular paper Attention Is All You Need by Ashish Vaswani et al.

Types of Transformer Networks

We have three major types of transformer networks which are Encoder, Decoder and Sequence2Sequence transformer networks.

Encoder Transformer Network: It is a bidirectional transformer network, it takes in text, produces a feature vector representation for each word in the sentence. Encoder uses self attention mechanism to understand the context of words used in a sentence, and extracts useful information from the words.

A diagram representation of how encoder is able to understand this simple sentence “Coding is amazing” .

Diagram breakdown: Encoder makes use of self attention mechanism to generate a feature vector or numerical representation for each word in a sentence. The word “Coding” is assigned a feature vector1, the word “is” is assigned a feature vector2 and the word “amazing” is assigned the feature vector3. Feature vector of the word “is” represents the context of both the word “is” and the information of the feature vectors of the words on both sides of it which are “Coding” and “amazing”, hence the name bidirectional network, because it studies the context of words on both left and right sides. Feature vectors are used to study the relationship that exists among words to understand the context of the words used, and interpret the meaning of the sentence. Imagine this is a sentiment analysis task where we want classify if this sentence has a positive or negative sentiment. The network has been able to understand this sentence, by studying the context of each word and how it relates with each word, therefore classifies the sentence as positive. It is positive, because we are describing “coding” and the adjective we used to describe it was “amazing”.

Encoder network is used in solving classification problems like Masked Language Modeling to predict a masked word in a sentence, and sentiment analysis to predict positive and negative sentiments in sentences. The common encoder transformer networks are BERT, ALBERT and DistilBERT.

Decoder Transformer Network or Autoregressive Models. It uses masked attention mechanism to understand context in sentences for generation of words.

Imagine a simple scenario where we trained a decoder network on text corpus containing information about various movies to generate sentences or autocomplete sentences about movies. We pass in this incomplete sentence “Marvel Avengers Endgame” into the decoder model, and we want the model to predict the appropriate words to complete this sentence. Decoder is a unidirectional network, and it generates feature vector representation for each word. What differentiates it from encoder network is that it is unidirectional opposed to encoder bidirectional feature. Decoder studies word representation from a single context, either right or left context. In this case it studies the context of the words on the left to generate the next words. It generates the next word “is”, based on the previous words, after “is” it generates the next word “superhero”, after “is” it generates the next word “a”, finally it generates the next word “movie”. Therefore the complete sentence is “Marvel Avengers Endgame is a superhero movie”. We can observe how it generates words based on the previous words, hence the word autoregressive, it must look backwards to study the context of words, and extracts information from previous words to generate the next words. Examples of autoregressive networks are GPT-3 and CTLR.

Encoder-Decoder or Sequence2Sequence Transformer Network: It is a combination of encoder and decoder transformer networks. It is used in more complex natural language tasks like translation and text summarization.

Translation illustration of encoder-decoder network: It uses the encoder network to encode the words in a sentence to generate feature vectors for the words, understands the context, and extracts useful information from the words. The outputs from the encoder network are passed to the decoder network. The decoder network works on the outputs generated by the encoder, and generates the appropriate words in the target language, for example we pass a sentence in English into the encoder, and it extracts useful information from the English context, and pass it to the decoder, the decoder decodes the encoder outputs, and generate the sentence in French.

Concept of Tokenization

Word Tokenization: It is the conversion of a sentence into individual words. It usually generates a large vocabulary size that is not ideal for training NLP models.

Vocabulary size: It refers to the number of words in a text.

text = "Python is my favourite programming language"print(text.split())##Output
['Python', 'is', 'my', 'favourite', 'programming', 'language']

This is a sample code showing how word tokenization is done, we split a sentence into individual words.

Character Based Tokenization: It is the conversion of the words in a sentence into characters. For example a sentence like “Hello everyone” will be split into individual characters like this:

It generates a smaller vocabulary size compared to word tokenization, but it is not yet good enough, because splitting a word into its individual characters do not have the same meaning as the single word itself.

Subword Tokenization: It is the best form of tokenization employed in most natural language processing tasks. Word tokenization handles tokenization by splitting a sentence into individual words, and this approach is not perfect for all conditions. Th two words “bird” and “birds” in a sentence, one is singular and another is plural, word tokenization will treat them as different words, and this is where subword tokenization comes into place. Subword tokenization divides compound words and rare words into subwords, it considers the similarity of words like bird and birds , instead of splitting the word “birds” into two different words, it signifies that the word “s” at the end word “birds” is a subword of the word “bird”. A word like “meaningful” will be split into word “meaning” and subword “ful”, usually in this case a special character can be added to the subword like “##ful”, to indicate that it is not the beginning of a sentence, and it is a subword of another word. Subword tokenization algorithm is used in transformer networks like BERT, GPT. Bert uses word piece tokenizer as its subword tokenizer.

Train a Masked Language Model with Transformers

Our main goal is to train a masked language model using encoder transformer network, a model that can predict the appropriate word for a masked word in a sentence. In this tutorial we shall use hugging face transformers, a very good library that makes it easy to easily train a model with transformers.

This part of the tutorial requires a basic knowledge of Python programming language and Pytorch deep learning library.

Install Pytorch

PyTorch

Reduce inference costs by 71% and drive scale out using PyTorch, TorchServe, and AWS Inferentia. Pushing the state of…

pytorch.org

Install other packages

pip3 install transformerspip3 install datasetspip3 install accelerate

Finetuning a pretrained Masked Language Model for Movie reviews

We shall make use of a pretrained DistilBERT transformer model, a lighter version of BERT to train on IMDb dataset (a dataset containing thousands of reviews on different movies), to predict a masked word in a sentence.

Load and Tokenize Dataset

Load IMDB Data

Line 1–10: We imported in the module for loading the IMDb dataset, and print out the dataset information to confirm that it is loaded. We also randomly printed out two reviews from the dataset. If the dataset is properly loaded, this should be the output:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
'>> Review: "Against All Flags" is every bit the classic swashbuckler. It has all the elements the adventure fan could hope for and more for in this one, the damsel in distress is, well, not really in distress. As Spitfire Stevens, Maureen O'Hara is at her athletic best, running her foes through in defiance of the social norms of the period. Anthony Quinn rounds out the top three billed actors as the ruthless Captain Roc Brasiliano and proves to be a wily and capable nemesis for Brian Hawke (Flynn). For the classic adventure fan, "Against All Flags" is a must-see. While it may not be in quite the same league as some of Errol Flynn's earlier work (Captain Blood and The Sea Hawk, for instance), it is still a greatly entertaining romp.''>> Review: Deathtrap gives you a twist at every turn, every single turn, in fact its biggest problem is that there are so many twists that you never really get oriented in the film, and it often doesn't make any sense, although they do usually catch you by surprise. The story is very good, except for the fact that it has so many twists. The screenplay is very good with great dialogue and characters, but you can't catch all the development 
because of the twists. The performances particularly by Caine are amazing. The direction is very good, Sidney Lumet can direct. The visual effects are fair, but than again most are actually in a play and are fake. Twists way to much, but still works and is worth watching.'

It prints out the outline of the IMDb dataset, the train, test and unsupervised sections, and their number of rows. Each row represents a sentence review in the dataset. Both train and test sections have 25000 reviews each, while unsupervised section has 50000 reviews. The last two reviews are randomly printed from the IMDB dataset.

Tokenize Dataset

Line 1–11: Imported the Autokenizer package, we loaded the tokenizer from the DistilBERT model, which is a word piece subword tokenizer. We created a function to tokenize the IMDb dataset.

line 13–15: Finally we called the tokenizer function, and applied it on the loaded dataset. When we called the tokenizer function we removed the text and labels from the tokenized dataset, because it will no longer be needed. We printed the tokenized dataset, and it shows this output:

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In each part of the dataset sections we have the two features, input_ids and attention_mask. The input_ids are ids generated for the tokenized words. The attention_mask is a value generated by the tokenizer model to identify the input id of the word that is useful, and the input id of the word to be ignored, the attention value is generated in 1s and 0s, 1 represents the useful word, while 0 represents the word to be ignored.

Concat and Chunk Dataset

In Natural Language Processing we need to set a bench mark for text sequences length to be trained, the maximum length for DistilBERT pretrained model to be used is 512.

Line 2–6: We set a chunk size to 128. We used a chunk size of 128 instead of 512 because of GPU utilization. We concatenated all the text sequences in the dataset as a single concatenated dataset.

Line 8–13: We obtained the total length of the concatenated dataset, created a dictionary comprehension to loop through the concatenated length, and we divided this concatenated text into chunks according to the chunk size of 128. If a very powerful GPU is available, a chunk size 512 should be used. The concatenated dataset is divided into many chunks of equal size, but the last chunk is usually smaller, and we shall dropped the last chunk.

Line 18–22: The dictionary with the chunks is given a new column labels to contain the input ids of the chunked samples. Finally we applied the concat chunk function on the tokenized dataset.

Mask Test Dataset For Evaluation

Line 1–9: We imported the DataCollatorForLanguageModeling from transformers, a default package for creating masked columns in the dataset. The dataset was downsampled to 10000, split 10% of the samples which is 1000 for test dataset to be used for evaluation.

Line 13–24: We defined a data collector and defined a function for inserting masks randomly in a dataset. The insert random mask function is applied on the test dataset, to replace unmasked columns with the masked columns. The masked test dataset will serve as ground truth labels for testing the model during training.

Training Procedure

Line 9–23: We set batch size to 32, loaded the train and test datasets using pytorch inbuilt dataloader. We loaded the pretrained DistilBERT model and used Adam Optimizer.

Line 26–28: We called the transformers accelerator library for training, and it takes in the pretrained model, optimizer, train and evaluation datasets for preparation for training.

Line 31–37: We set the number of training epochs, obtain the length of train data loader and calculated the training steps. Finally we set the learning rate scheduler function that accepts the optimizer, warm up steps and the training steps for training.

Train Code

Line 4–7: We defined a progress bar using python inbuilt tqdm for training progress monitoring, then set a directory for output trained models.

Line 9–19: Defined a for loop to loop through the number of epochs, for each epoch we started the training of the dataset, loop through the train dataloader, computed the outputs from the model, calculate the loss on the outputs, used transformers accelerator package imported to perform backward propagation on the model, used optimizer to optimize the model to minimize loss. We applied learning rate scheduler, used optimizer to set accumulated gradients to zero, and updated the progress bar. We do this until we finished training the entire dataset for an epoch.

Line 22–38: We evaluated the trained model for an epoch on the test dataset, computed the losses on the test dataset similar to how it was done during training. We calculated the cross entropy loss for the model, then calculated the exponential of the loss to obtain the Perplexity of the model.

Perplexity is a metric that used for evaluating language models. It is the exponential of the cross entropy loss.

Line 41–45: We used accelerator to save the pretrained model, and used the tokenizer to save important files about the model, such as the tokenizer and vocabulary info. The trained model and the configuration files are saved in the output directory folder MLP_TrainedModels. The output directory will contain the following files. I trained for 30 epochs and I got a perplextiy value of 9.19. The output models folder directory will look like this:

--MLP_TrainedModels
    --config.json
    --pytorch_model.bin
    --special_tokens_map.json
    --tokenizer_config.json
    --tokenizer.json
    --vocab.txt

Full Training Code

Test Trained Model

The trained models are stored in MLP_TrainedModels, and we pasted the directory to set the model value. We print out a list of generated sentences from the model with the appropriate value for the masked word in the sentence.

Output

>>> this is an excellent movie.
>>> this is an amazing movie.
>>> this is an awesome movie.
>>> this is an entertaining movie.

We can see the predictions from the model for the masked word which are excellent, amazing, awesome and entertaining. The predictions fit perfectly in completing the sentences.

We have successfully trained a masked language model with an encoder transformer network that can find the correct word to replace a masked word in a sentence.

I have pushed the Masked Language Model I trained to huggingface hub and it is available for testing. Check the Masked Language Model on hugging face repository

ayoolaolafenwa/Masked-Language-Model · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Rest API Code for Testing the Masked Language Model

This is the inference API python code for testing the masked language model directly from hugging face.

Output

washington dc is the capital of usa.

It produces the correct output, “washington dc is the capital of usa.”

Load the Masked Language Model with Transformers

You can easily load the Language model with transformers using this code.

Output

is

It prints out the predicted masked word “is”.

Colab Training

I created a google colab notebook with steps on creating a hugging face account, training the Masked Language Model and uploading the model to Hugging Face repository. Check the notebook.

Google Colaboratory

Edit description

colab.research.google.com

Check the github repository for this tutorial

GitHub - ayoolaolafenwa/TrainNLP: Sample tutorials for training Natural Language Processing Models…

This is a step by step guide using hugging face transformers to create a Masked Language Model to predict a masked word…

github.com

Conclusion

We discussed in detail in this article the basics of Natural Language Processing, how transformers work, the different types of transformer networks, the process of training a Masked Language Model with transformers, and we successfully trained a transformer model that can predict a masked word in a sentence.