In NLP, the transformer model architecture has been a revolutionary that greatly enhanced the ability to understand and generate textual information.
In this tutorial, we are going to dig-deep into BERT, a well-known transformer-based model, and provide an hands-on example to fine-tune the base BERT model for sentiment analysis.
BERT, introduced by researchers at Google in 2018, is a powerful language model that uses transformer architecture. Pushing the boundaries of earlier model architecture, such as LSTM and GRU, that were either unidirectional or sequentially bi-directional, BERT considers context from both past and future simultaneously. This is due to the innovative “attention mechanism,” which allows the model to weigh the importance of words in a sentence when generating representations.
The BERT model is pre-trained on the following two NLP tasks:
- Masked Language Model (MLM)
- Next Sentence Prediction (NSP)
and is generally used as the base model for various downstream NLP tasks, such as sentiment analysis which we will cover in this tutorial.
The power of BERT comes from its two-step process:
- Pre-training is the phase where BERT is trained on large amounts of data. As a result, it learns to predict masked words in a sentence (MLM task) and to predict if a sentence follows another one (NSP task). The output of this stage is a a pre-trained NLP model with a general-purpose “understanding” of the language
- Fine-tuning is where the pre-trained BERT model is further trained on a specific task. The model is initialized with the pre-trained parameters, and the entire model is trained on a downstream task, allowing BERT to fine-tune its understanding of language to the specifics of the task at hand.
The complete code is available as a Jupyter Notebook on GitHub
In this hands-on exercise, we will train the sentiment analysis model on the IMDB movie reviews dataset  (license: Apache 2.0), which comes labeled whether a review is positive or negative. We will also load the model using the Hugging Face’s transformers library.
Let’s load all the libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
# Variables to set the number of epochs and samples
num_epochs = 10
num_samples = 100 # set this to -1 to use all data
First, we need to load the dataset and the model tokenizer.
# Step 1: Load dataset and model tokenizer
dataset = load_dataset('imdb')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Next, we’ll create a plot to see the distribution of the positive and negative classes.
# Data Exploration
train_df = pd.DataFrame(dataset["train"])
Next, we preprocess our dataset by tokenizing the texts. We use BERT’s tokenizer, which will convert the text into tokens that correspond to BERT’s vocabulary.
# Step 2: Preprocess the dataset
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
After that, we prepare our training and evaluation datasets. Remember, if you want to use all the data, you can set the
num_samples variable to
if num_samples == -1:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(num_samples))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(num_samples))
Then, we load the pre-trained BERT model. We’ll use the
AutoModelForSequenceClassification class, a BERT model designed for classification tasks.
For this tutorial, we use the ‘bert-base-uncased’ version of BERT, which is trained on lower-case English text, is used for this tutorial.
# Step 3: Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
Now, we’re ready to define our training arguments and create a
Trainer instance to train our model.
# Step 4: Define training arguments
training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch", no_cuda=True, num_train_epochs=num_epochs)
# Step 5: Create Trainer instance and train
trainer = Trainer(
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
Having trained our model, let’s evaluate it. We’ll calculate the confusion matrix and the ROC curve to understand how well our model performs.
# Step 6: Evaluation
predictions = trainer.predict(small_eval_dataset)
# Confusion matrix
cm = confusion_matrix(small_eval_dataset['label'], predictions.predictions.argmax(-1))
sns.heatmap(cm, annot=True, fmt='d')
# ROC Curve
fpr, tpr, _ = roc_curve(small_eval_dataset['label'], predictions.predictions[:, 1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(1.618 * 5, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
The confusion matrix gives a detailed breakdown of how our predictions measure up to the actual labels, while the ROC curve shows us the trade-off between the true positive rate (sensitivity) and the false positive rate (1 — specificity) at various threshold settings.
Finally, to see our model in action, let’s use it to infer the sentiment of a sample text.
# Step 7: Inference on a new sample
sample_text = "This is a fantastic movie. I really enjoyed it."
sample_inputs = tokenizer(sample_text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
# Move inputs to device (if GPU available)
# Make prediction
predictions = model(**sample_inputs)
predicted_class = predictions.logits.argmax(-1).item()
if predicted_class == 1:
By walking through an example of sentiment analysis on IMDb movie reviews, I hope you’ve gained a clear understanding of how to apply BERT to real-world NLP problems. The Python code I’ve included here can be adjusted and extended to tackle different tasks and datasets, paving the way for even more sophisticated and accurate language models.