Introduction to NLP models

This is a course for introducing the readers to the state-of-the-art NLP models by WikiClub students IIIT Hyderabad.

Natural Language Processing (NLP) models are a vital part of artificial intelligence that enable computers to interact with human language. These models use machine learning and deep learning techniques to understand, interpret, and generate human language text. NLP models have a wide range of applications, including language translation, sentiment analysis, chatbots, and search engines.

In this tutorial, we're going to explore different NLP models and how they're used in various real-life language tasks. We'll look at their basic ideas, how they're structured, where they come in handy, and how you can work with them in languages like Python. We'll also touch on similar models in the NLP field.

Models:

BERT
BART
T5
ROBERTa

Data Cleaning: Natural languages are a free kind of text, which implies they are very unstructured. As a result, cleaning and preparing data to extract features is essential in NLP when creating any model.

Removing stopwords: There are a few terms that are frequently used when humans interact, but they make no sense and contribute no value. Furthermore, there may be a few words that are unnecessary for the business case at hand. As a result, these words must be removed from the database. This can be done using NLTK.
Lower case: Converting all the letters to lower case to maintain uniformity across the data.
Lemmatization: This is a text pre-processing technique used in natural language processing (NLP) models to find similarities by breaking a word down to its fundamental meaning. The word "talk," for example, may occur as "talking," "talks," or "talked."

Tokenization: NLP models begin by breaking down text into smaller units called tokens. Tokens are typically words or subwords. This process is essential for processing and understanding text effectively.
Word Embeddings: NLP models often represent words as vectors in a high-dimensional space. Word embeddings capture semantic relationships between words, enabling the model to understand the meaning and context of words in a sentence.
Recurrent Neural Networks (RNNs): RNNs are a type of neural network that is well-suited for processing sequential data, such as text. They process input data sequentially and maintain a hidden state, which can capture contextual information.
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU): These are specialized types of RNNs that address the vanishing gradient problem, allowing models to capture long-range dependencies in text. LSTMs and GRUs are commonly used in NLP for tasks like language modeling and machine translation.
Convolutional Neural Networks (CNNs): While CNNs are often associated with image processing, they can also be adapted for NLP tasks. In text, they are used to extract features from n-grams or for sentiment analysis.
Transformer Models: Transformer models have revolutionized NLP. They use attention mechanisms to process input data in parallel, rather than sequentially. This has enabled them to achieve state-of-the-art results in many NLP tasks. Key examples of transformer models include BERT, GPT, and T5.
Pre-trained Models: Many NLP models are pre-trained on vast amounts of text data and then fine-tuned for specific tasks. This transfer learning approach has significantly improved the efficiency and effectiveness of NLP models.
Attention Mechanisms: Attention mechanisms enable models to focus on specific parts of the input sequence, which is particularly useful for understanding context and relationships within text.
Language Models: Language models learn to predict the next word in a sequence, which allows them to capture the probabilistic nature of language. Language models can be used for tasks like auto-completion and text generation.

Different downstream NLP tasks in general'

Text Classification: Assigning categories or labels to text, like spam detection in emails.

Named Entity Recognition (NER): Identifying and categorizing entities such as names of people, places, and organizations in text.
Sentiment Analysis: Determining the emotional tone or sentiment of a piece of text, such as positive, negative, or neutral.
Machine Translation: Translating text from one language to another, like Google Translate.
Summarization: Condensing a long piece of text into a shorter, coherent summary.
Question Answering: Automatically providing answers to questions posed in natural language, like chatbots.
Language Generation: Creating human-like text, often used in chatbots, content generation, and more.

BERT

Overview

In the fast-evolving landscape of natural language processing (NLP), the BERT model has emerged as a game-changer. BERT, short for Bidirectional Encoder Representations from Transformers, has revolutionized the way machines understand human language. This section provides an in-depth exploration of BERT, from its inception to its applications and implementation.

Need for BERT

Before the emergence of BERT, traditional natural language processing (NLP) models faced notable limitations. These models, predominantly based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), struggled to understand human language effectively. They lacked contextual understanding, treating words as isolated entities and ignoring nuances in language. Additionally, they often used fixed-length input sequences, which made handling variable-length text a challenge. These models required domain-specific knowledge and extensive data for training, and they had difficulty resolving the multiple meanings of polysemous words in different contexts. The deficiencies of these pre-BERT models underscored the pressing need for a transformative breakthrough in NLP.

BERT, or Bidirectional Encoder Representations from Transformers, was introduced by Google AI researchers in 2018 to address the shortcomings of traditional NLP models. It ushered in a new era of NLP by offering solutions to these limitations. BERT's key innovation lies in its ability to understand contextual nuances by reading text in both directions during pretraining. This contextual understanding significantly enhanced its performance on a wide range of NLP tasks. Moreover, BERT could handle variable-length text, reduce data requirements, and effectively resolve polysemy, all while leveraging the power of transfer learning. By bridging the gap between pretraining and fine-tuning, BERT made NLP more adaptable, efficient, and accessible, setting a benchmark for modern natural language understanding and applications.

Architecture

BERT's architecture is built upon the foundation of the Transformer model, which was introduced by Vaswani et al. in their 2017 paper. It leverages a bidirectional approach to understand language, capturing contextual information from both left-to-right and right-to-left directions. This bidirectional capability, coupled with the Transformer's self-attention mechanism, is what makes BERT an NLP powerhouse. At the core of BERT's architecture lies the Transformer model, which consists of an encoder-decoder structure. In BERT's case, only the encoder is used because it is designed for tasks such as language understanding, rather than language generation

Multi-head Self Attention

The self-attention mechanism allows BERT to weigh the importance of each word in a sentence relative to others, enabling it to grasp contextual relationships effectively. BERT uses multi-head self-attention, where multiple sets of attention weights are learned, each capturing different aspects of the input data. This enhances the model's ability to focus on different parts of the input text simultaneously.

Positional Encoding

To account for the order of words in a sentence, positional encodings are added to the word embeddings. This ensures that BERT can differentiate between words with the same content but in different positions within a sentence.

Stacked Encoders

BERT employs multiple stacked encoder layers, typically 12 or 24 in the case of BERT-base and BERT-large, respectively. Each encoder layer processes the input sequentially, refining the contextual information at each step.

Pretraining & Fine-tuning

One of BERT's groundbreaking features is its bidirectional pretraining. Instead of training solely from left to right as in traditional models, BERT reads text in both directions. This means that it can capture the full context of a word by considering all the words that precede and follow it in a sentence. The bidirectional approach is vital for understanding the rich context and nuances of human language.

BERT's architecture consists of two main stages: pretraining and fine-tuning. During pretraining, BERT is pretrained on a massive corpus of text, such as the BooksCorpus and English Wikipedia. It learns to predict missing words in sentences, a task known as masked language modeling. This process equips BERT with a deep understanding of language and a broad vocabulary.

In the fine-tuning stage, BERT is adapted for specific NLP tasks, such as sentiment analysis or named entity recognition. The model's pretrained knowledge is fine-tuned by training on task-specific data, allowing it to excel in a wide array of NLP applications.

Applications

BERT has found extensive applications across the field of natural language processing (NLP). It excels in tasks like sentiment analysis, named entity recognition, and question answering, where it comprehensively understands and contextualizes language, offering superior accuracy. BERT's versatile pretrained representations have transformed machine translation, improving fluency and context relevance in translation services. In search engines, BERT has reshaped how results are ranked, ensuring more precise and contextually relevant responses to user queries. It has also been employed in chatbots, voice assistants, and text summarization, enhancing human-computer interactions and content generation. The ability to adapt BERT to various NLP tasks through fine-tuning has made it a foundational technology for a diverse range of applications, from healthcare and finance to social media analysis and customer support.

Implementation

In this section, we provide an example of how we can utilize the BERT model of Transformers library of HugginFace for a zero-shot text classification task. Zero-shot learning with BERT allows you to perform tasks for which the model hasn't been explicitly fine-tuned.

In the following python code, we load a BERT model (bert-base-uncased) and tokenizer. We then tokenize the input text and create label embeddings for the candidate labels. The BERT model is used to predict labels for the input text, and the softmax probabilities for each candidate label are calculated. The label with the highest probability is considered the predicted label for the given text.

from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load a BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Example input text and possible class labels
text = "Apple is going to release a new product"
candidate_labels = ["business", "technology", "sports"]

# Tokenize the input text
tokens = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='pt')

# Create label embeddings
label_ids = tokenizer(candidate_labels, return_tensors='pt', padding=True, truncation=True)['input_ids']
label_ids = label_ids.to(model.device)

# Perform zero-shot classification
with torch.no_grad():
    logits = model(**tokens, labels=label_ids).logits

# Calculate softmax probabilities for the candidate labels
softmax = torch.nn.functional.softmax(logits, dim=1)
probabilities = softmax[0]

# Find the label with the highest probability
max_prob_index = torch.argmax(probabilities).item()
predicted_label = candidate_labels[max_prob_index]

# Print the results
print(f"Predicted Label: {predicted_label}")
print("Class Probabilities:")
for label, prob in zip(candidate_labels, probabilities):
    print(f"{label}: {prob:.4f}")

Variants

BERT comes in various model variants, depending on the number of parameters and size of the model. BERT-base, with 110 million parameters, is a popular choice, while BERT-large, with 340 million parameters, offers even more formidable capabilities. Smaller variants like BERT-mini are also available, providing flexibility for different computational resources and tasks. There are other task-specific variants also like, HateBERT (BERT model pre-trained on Hate Speech) and BERTweet (BERT model trained on twitter corpus) which may prove more beneficial than the original BERT model in certain tasks.

Peformance

BERT achieved state-of-the-art performance in a multitude of natural language processing (NLP) tasks. It set new benchmarks in tasks like question answering, named entity recognition, sentiment analysis, and machine translation. BERT's contextual understanding of language, enabled by bidirectional pretraining, allowed it to capture intricate nuances and dependencies in text, significantly outperforming previous NLP models. Its versatility and adaptability, through fine-tuning on task-specific data, made it the go-to model for a wide range of applications, from healthcare and finance to search engines and chatbots. BERT's exceptional results on various benchmark datasets and real-world applications underscore its transformative impact on NLP, solidifying its position as a foundational technology in the field.

Limitations

BERT (Bidirectional Encoder Representations from Transformers) has made remarkable strides in natural language processing, but it also has notable limitations. One key drawback is its substantial computational demands. BERT's architecture is deep and consists of a massive number of parameters, making training and fine-tuning computationally intensive. This can pose challenges for individuals and organizations with limited access to high-performance computing resources. The need for powerful GPUs or TPUs and substantial memory can be a barrier to entry for many, restricting the broader adoption of BERT in resource-constrained environments.

Another limitation of BERT is its lack of domain specificity. While BERT excels in understanding general language patterns, it might not perform optimally in domain-specific or highly specialized contexts. Fine-tuning BERT on task-specific data can mitigate this to some extent, but it may still fall short of models trained specifically for those domains. In applications where precise domain knowledge is crucial, using BERT as a general-purpose model might not yield the desired level of accuracy and contextual understanding. Researchers and practitioners are actively working on ways to make BERT more efficient, domain-specific, and accessible to a broader audience.

BART

Overview

BART, which stands for Bidirectional and Auto-Regressive Transformers, is a cutting-edge natural language processing (NLP) model introduced in 2019 by Facebook AI. It is a GPT-2 sister model that has received a lot of attention for its capacity to do both text production and text interpretation tasks. BART blends bidirectional pretraining with auto-regressive fine-tuning to create a versatile NLP model that performs well in a variety of language tasks.

Need For BART

The limitations of existing NLP models, such as unidirectional models such as RNN's, HMM's, CRF's, Rule based models that struggled to capture contextual information efficiently, drove the need for BART. BART intends to address this by integrating the best of both worlds: bidirectional pretraining for text comprehension and auto-regressive fine-tuning for text production.

Capturing Bidirectional context

BART was created to overcome the shortcomings of unidirectional models like autoregressive models (e.g., GPT) and unidirectional RNNs, which process text in just one way. Because they do not consider future words, unidirectional models may struggle to capture the whole context of a word or phrase. BART's bidirectional architecture is intended to alleviate this restriction by taking into account both left and right context at the same time. This improves its effectiveness in comprehending the meaning of words in context.

Versatility in NLP Tasks

BART was created as a versatile NLP model. Because of its capacity to pretrain bidirectionally and fine-tune autoregressively, it is well suited for a variety of tasks such as text summarization, machine translation, text production, and document classification. This adaptability is beneficial since it enables a single model to succeed in numerous NLP domains.

Advancing Summarization and Translation

The architecture of BART is ideal for abstractive text summarization and machine translation. It can effectively collect and reformulate the main information in a source text by learning bidirectional context during pretraining. This is essential for creating accurate and coherent summaries or translations.

Architecture

BART's architecture is based on the Transformer model, similar to models like BERT and GPT. Here are key components of BART's architecture:

Multi-head Self-Attention

BART employs multi-head self-attention processes to assess the significance of various words in the input text. This enables it to capture word dependencies and linkages. It can better comprehend the contextual links between words by attending to many spots in the input text at the same time.

Positional Encoding

BART employs positional encodings to account for the sequential character of language. These encodings convey information on the word order in a sentence. This is necessary for distinguishing between words in different positions in a sentence and recognising their context responsibilities.

Stacked Encoders

BART uses a stack of encoders to filter input text across multiple layers in order to extract hierarchical features and contextual information.

Bidirectional Pretraining and Autoregressive Fine-Tuning

The dual-stage training method is the major innovation in BART's architecture. It begins with bidirectional pretraining, in which the model learns to properly understand text. The model is then fine-tuned for specific tasks by learning to generate text, which is followed by autoregressive fine-tuning. BART is highly adaptive and versatile because to the mix of bidirectional and autoregressive techniques.

Pretraining & Fine-tuning

Pretraining is the first stage in the training of models such as BART. The model is exposed to a huge corpus of text data during this phase and learns to capture general patterns, grammar, and semantics of language. The following are the main components of BART pretraining:

Bidirectional Learning: Because BART's pretraining is bidirectional, it considers the context of a word or phrase from both the left (prior words) and the right (next words). In contrast to unidirectional models, which normally process text in only one direction, bidirectional models handle text in both directions. It enables BART to provide a more detailed representation of the language's structure and meaning.

Masked Language Modelling: BART learns to predict missing words in sentences during pretraining, a technique known as masked language modelling. The model must anticipate the masked words based on the surrounding context because some words in the input text are randomly masked. This challenge encourages the model to acquire word contextual relationships.

Depth and Capacity: During pretraining, BART typically employs deep neural networks with a large number of parameters. This enables the model to capture complex and hierarchical linguistic patterns. The model's depth and capacity are critical to its ability to learn and represent a wide range of language nuances.

The second phase of BART training is fine-tuning, in which the pretrained model is tailored for specific natural language processing tasks. The model is fine-tuned for various applications by updating its parameters based on task-specific data. The following are the most important parts of fine-tuning:

Task-Specific Objectives: The model is exposed to labelled data for the target task during fine-tuning. The goals for fine-tuning vary depending on the application. The goal of text summarization, for example, may be to provide succinct and coherent summaries, whereas the goal of machine translation may be to translate text from one language to another.

Auto-Regressive Generation: BART's fine-tuning procedure frequently includes autoregressive generation, in which the model generates text one token at a time. Based on the previous tokens, autoregressive models forecast the next token. This method is beneficial for tasks such as text generation, summarization, and machine translation.

Transfer Learning: Fine-tuning makes use of the knowledge and representations acquired during pretraining. The model's pretrained weights and embeddings provide a solid foundation for fine-tuning. When compared to training from start, this knowledge transfer helps the model to adapt quickly to the target task with less data and training time.

Task-Specific Data: The model is refined using task-specific datasets including examples relevant to the intended application. For example, if text summarization is being fine-tuned, the dataset may consist of pairs of articles and associated human-generated summaries.

Application

BART has been used in a variety of NLP tasks, including but not limited to text summarization, machine translation, text generation, and document classification. Because of its bidirectional and auto-regressive features, it is an excellent candidate for a variety of natural language interpretation and creation jobs.

Implementation

BART is commonly implemented using deep learning frameworks like PyTorch or TensorFlow. These frameworks frequently include pre-trained models and fine-tuning scripts to make it easier to utilise BART for certain NLP tasks.

In this example, we'll load a BART model and tokenizer, and then use the model to generate a summary for a given input text.

from transformers import BartForConditionalGeneration, BartTokenizer
import torch

# Load a BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Example input text for summarization
input_text = """
In a groundbreaking discovery, scientists have found evidence of water on Mars. 
This discovery opens up new possibilities for future space exploration.
"""

# Tokenize the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate a summary
with torch.no_grad():
    summary_ids = model.generate(input_ids, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the generated summary
print("Generated Summary:")
print(summary)

We load the BART model and tokenizer, specifically the "facebook/bart-large-cnn" model.
We provide an input text that we want to summarize.
We tokenize the input text and encode it for the model.
We use the model to generate a summary of the input text. You can adjust parameters like max_length, min_length, length_penalty, and num_beams to control the quality and length of the summary.
Finally, we decode the generated summary and print it.

This code demonstrates how to utilize the BART model for text summarization, allowing you to generate concise and coherent summaries of longer pieces of text.

Variants

Several versions of the BART concept have arisen to satisfy unique use cases and needs. Domain-specific models fine-tuned for certain industries or applications, as well as smaller, more efficient versions for deployment on resource-constrained devices, may be among these varieties. Furthermore, researchers investigated ways for boosting the model's performance on certain NLP tasks, yielding task-specific adaptations and variants. These versions have broadened BART's application and proved its adaptability in solving a variety of language-related difficulties. The variants of BART model are mBART(Multilingual BART), BART-Sum, BART-Large.

Performance

BART has consistently outperformed other natural language processing benchmarks. It routinely produces cutting-edge results in tasks like text summarization, machine translation, and document classification. The ability of BART to combine bidirectional pretraining with autoregressive fine-tuning adds to its performance, giving it a strong choice for tasks requiring both text understanding and creation. The model's performance has accelerated advances in NLP and played a critical role in pushing the boundaries of what is possible in language-related activities.

Limitations

BART has limitations, despite its tremendous potential. The model's large size and computing requirements may make deployment difficult in resource-constrained contexts. Fine-tuning BART for individual tasks might take time and may require a large amount of task-specific training data. Furthermore, the autoregressive character of BART may make it less efficient for text production when compared to purely generative models. While the model excels at many NLP tasks, it may not be the best option in every situation, and practitioners should examine the trade-offs and requirements of their unique use cases before choosing a model.