Diving Deep with Hugging Face: The GitHub of Deep Learning & Large Language Models! | by Senthil E

4. Tokenize the Input: Use the tokenizer to convert your input text into a format that’s suitable for the model. This might involve breaking text down into subwords, encoding it as integers, and adding special tokens.

input_text = "Hello, Hugging Face!" 
encoded_input = tokenizer.encode(input_text, return_tensors='pt')

5. Load the Model:Once you’ve chosen a model and have tokenized your input, load the pre-trained model.

from transformers import BertForMaskedLM 
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

6.Use the Model: With your tokenized input and loaded model, you can now perform inference. The exact method and outputs will vary depending on the model and its design.

output = model(encoded_input)

7. Decoding (if necessary): For certain tasks, you’ll need to decode the model’s output back into human-readable text. This is common for tasks like text generation or sequence-to-sequence tasks.

predicted_token_ids = output[0].argmax(dim=2) 
predicted_text = tokenizer.decode(predicted_token_ids[0])

8. Fine-tuning (Optional): If you’re not just doing inference but also wish to fine-tune the model on your own dataset, you’ll need to set up a training loop, define a loss function, and update the model’s weights using an optimizer. Hugging Face’s Trainer class simplifies this process.

9. Save & Load Fine-Tuned Model (Optional): After fine-tuning, you can save the model and tokenizer for later use.

# Save model and tokenizer
model.save_pretrained("./my_model_directory/")
tokenizer.save_pretrained("./my_model_directory/")# Load them back
model = AutoModel.from_pretrained("./my_model_directory/")
tokenizer = AutoTokenizer.from_pretrained("./my_model_directory/")

10.Using Pipelines (for simplicity): For many standard tasks (e.g., sentiment analysis, named entity recognition), Hugging Face provides the pipeline utility, which abstracts away much of the above process into a simpler API.

from transformers import pipeline 
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased', tokenizer='distilbert-base-uncased') 
result = classifier("I love Hugging Face!")

This is a general overview of the workflow when using models from Hugging Face’s Transformers library. The exact steps and code might differ depending on the specific model and task.

Let’s try a topic classification:.

Task:

Classify news articles into one of three topics: “Sports”, “Politics”, or “Technology”.

# Topic Classification using Hugging Face's Transformers# Installation:
# Make sure you've installed the required libraries.
# pip install transformers torch
# Import necessary libraries and modules
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
# 1. Load Model & Tokenizer:
# Using a pre-trained BERT model. For real-world usage, you'd ideally fine-tune this on your specific dataset.
model_name = "bert-base-uncased"
# We specify num_labels=3 since we have three topics: "Sports", "Politics", and "Technology".
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
tokenizer = BertTokenizer.from_pretrained(model_name)
# 2. Create a Classification Pipeline:
topic_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# 3. Predict:
text = "The latest GPU's have caused a surge in PC gaming popularity."
result = topic_classifier(text)
print(result)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'label': 'LABEL_2', 'score': 0.43456608057022095}]

Given the lack of fine-tuning, the output will likely not be meaningful. In practice, you would train the model on labeled data corresponding to the three categories.

Fine-tuning:

To truly make this useful, you’d need to fine-tune the model on a dataset of labeled news articles. This would involve:

Preprocessing your dataset to tokenize and format the news articles correctly.
Setting up a training loop or using Hugging Face’s Trainer to fine-tune the model on your specific dataset.
Evaluating the model’s performance on a separate test set.

Text summarization involves generating a concise summary retaining the most salient information from a long text document.
Hugging Face provides pretrained summarization models like BART, T5, Pegasus and mT5 in its model hub.
These models are pretrained on large datasets to generate summaries of input texts.
The summarization pipeline in Hugging Face makes it easy to utilize these models out-of-the-box.
It handles preprocessing the input, passing it to model and returning the generated summary.
Users can fine-tune the summarization models on custom datasets using the Trainer API for better performance.
Overall, Hugging Face provides easy access to cutting edge summarization models for research and applications.

from transformers import pipelinesummarizer = pipeline("summarization") 
text = """"The Tower of London, officially Her Majesty's Royal Palace and Fortress of the Tower of London, is a historic castle on the north bank of the River Thames in central London. It lies within the London Borough of Tower Hamlets, which is separated from the eastern edge of the square mile of the City of London by the open space known as Tower Hill. It was founded towards the end of 1066 as part of the Norman Conquest of England. The White Tower, which gives the entire castle its name, was built by William the Conqueror in 1078 and was a resented symbol of oppression, inflicted upon London by the new ruling elite. The castle was used as a prison from 1100 until 1952, although that was not its primary purpose. A grand palace early in its history, it served as a royal residence. As a whole, the Tower is a complex of several buildings set within two concentric rings of defensive walls and a moat."""
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
print(summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.Tower of London was founded towards the end of 1066 as part of the Norman 
Conquest of England . The White Tower, which gives the entire castle its 
name, was built by William the Conqueror in 1078 . The castle was used as 
a prison from 1100 until 1952, although that was not its primary purpose .

Here’s a basic example using the BartForConditionalGeneration model.

from transformers import BartForConditionalGeneration, BartTokenizer# Load the model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Provide a sample text that you want to summarize
text = """
The Hubble Space Telescope has made some of the most dramatic discoveries in the history of astronomy. 
From its vantage point 370 miles above Earth, Hubble has beamed back images of distant galaxies, 
nebulae, and star clusters, shedding light on nearly every aspect of the universe.
"""
# Encode the text and generate the summarized ids
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
# Decode the ids to get the summarized text
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

The Hubble Space Telescope has made some of the most dramatic discoveries 
in the history of astronomy. From its vantage point 370 miles above Earth, 
Hubble has beamed back images of distant galaxies, nebulae, and star clusters.

We can set a few hyperparameters for the generation like max_length, min_length, length_penalty, and num_beams to influence the length and quality of the summary.

Text Summarization DataSets:

Hugging Face’s datasets library provides a collection of datasets that can be readily used for various tasks, including text summarization. Here are some popular datasets for text summarization training:

CNN/Daily Mail:

This is a common dataset used for extractive and abstractive summarization. It contains news articles and their respective summaries.
Usage:

from datasets import load_dataset 
dataset = load_dataset("cnn_dailymail", "3.0.0")

2. XSum:

The Extreme Summarization (XSum) dataset contains BBC articles accompanied by single-sentence summaries.

3. Gigaword:

This dataset contains a large number of articles and their respective headlines from various news agencies. It’s typically used for abstractive summarization.

4. MultiNews:

MultiNews contains news articles and their summaries, which are created by combining multiple articles on the same topic.

5. SAMSum:

The SAMSum dataset consists of dialogue-based data, providing conversations and their respective summaries.

6. BillSum:

BillSum contains text from US Congressional and California state bills with human-written summaries.

7. BigPatent:

As the name suggests, this dataset contains patent documents and their respective abstracts.

Pre-trained Models:

Hugging Face hosts numerous state-of-the-art translation models in various languages and language pairs.
Examples include MarianMT, T5, and BERT-based models tailored for translation tasks.

Ease of Use with Pipelines:

Hugging Face’s pipeline API offers an easy way to perform translation without delving deep into model details.
For instance: translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de").

Fine-tuning:

The Transformers library enables users to fine-tune existing translation models on custom datasets, enhancing performance for domain-specific applications.

Datasets for Translation:

Hugging Face’s datasets library includes numerous datasets suitable for machine translation tasks, such as WMT and Opus.

Example:

Using Default model:

from transformers import pipeline# Initialize the translation pipeline
translator = pipeline("translation_en_to_fr")
# Provide a sample text that you want to translate
text = "Hello, how are you?"
# Translate the text
translation_output = translator(text)
# Extract the translated text
translated_text = translation_output[0]['translation_text']
print(translated_text)

Bonjour, comment êtes-vous?

2. Use an advanced model for translation:

pip install sentencepiece


from transformers import MarianMTModel, MarianTokenizer# Define the source language and target language
src_lang = 'en'
tgt_lang = 'de'
# Load the MarianMT model and tokenizer for English to German translation
model_name = "Helsinki-NLP/opus-mt-en-de"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Provide a sample text that you want to translate
text = "Hello, how are you?"
# Tokenize the text and translate
tokenized_text = tokenizer.encode(text, return_tensors="pt")
translated_tokens = model.generate(tokenized_text)
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translated_text)

Hallo, wie geht's?

Question answering involves predicting an answer to a question in text format based on context.
Hugging Face provides pretrained QA models like BERT, ALBERT, DistilBERT that can be finetuned for question answering.
The models are trained on SQuAD dataset and can answer questions based on a reference text.
The QAPipeline handles passing question-context inputs to the QA model and extracting the predicted answer.
Users only need to provide the question and context to the pipeline to get extracted answer text.
Models can also predict “no answer” if the context does not contain the answer.
For unanswerable questions, pipeline returns empty string instead of incorrect answers.
BERT-base model gets 80–90% F1 on SQuAD v1.1 which is close to human performance.
Users can fine-tune with Trainer API on custom datasets to improve domain-specific performance.

Example using default model:

from transformers import pipelinecontext = r"""The Tower of London, officially Her Majesty's Royal Palace and Fortress of the Tower of London, is a historic castle located on the north bank of the River Thames in central London. It lies within the London Borough of Tower Hamlets, separated from the eastern edge of the square mile of the City of London by the open space known as Tower Hill."""
qa_pipeline = pipeline("question-answering")
question = "Where is the Tower of London located?"
res = qa_pipeline({"question": question, "context": context})
print(res["answer"])

on the north bank of the River Thames in central London

2. Another example using advanced model:

from transformers import BertTokenizer, BertForQuestionAnswering
import torch# Load tokenizer and model
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)
# Context and Question
context = ("In its early years, the digital data processing industry was dominated by the IBM 701, "
"then eventually the IBM 704, IBM 709, IBM 7040, 7044, IBM 7090 and IBM 7094.")
question = "Which company dominated the digital data processing industry in its early years?"
# Tokenize input
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
# Get answer
output = model(**inputs)
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(answer)

ibm 701

Some important Datasets used for Question-Answering:

SQuAD (Stanford Question Answering Dataset)
SQuAD 1.1: Contains 100,000+ question-answer pairs based on 500+ Wikipedia articles.
SQuAD 2.0: Extends SQuAD 1.1 with questions that do not have an answer in the provided passage, requiring the model to determine when no answer is available.

NewsQA: A challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs.
CoQA (Conversational Question Answering Challenge): Contains 127,000+ questions with answers, collected from 16,000+ conversations.
QuAC (Question Answering in Context): A dataset for modeling, understanding, and participating in information seeking dialog.
MS MARCO: A large-scale dataset for reading comprehension and question answering. It focuses on real-world questions.
Natural Questions: Developed by Google AI Language, it uses naturally occurring questions to extract answers from Wikipedia articles.
RACE: A reading comprehension dataset collected from English examinations in China, which is designed for evaluating machine reading comprehension.
HotpotQA: A dataset with questions that require finding and reasoning over multiple evidence documents to answer.
DROP (Discrete Reasoning Over the content of Paragraphs): A reading comprehension benchmark where answering questions requires performing discrete operations over the content of paragraphs.
DuReader: A large-scale, open-domain Chinese reading comprehension dataset.
BoolQ: Consists of 15942 yes/no questions about short passages from Wikipedia.
BioASQ: A challenge on large-scale biomedical semantic indexing and question answering.
TriviaQA: Contains 650K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents.

To fetch any dataset using the datasets library, you can use:

from datasets import load_dataset# For example, to load the SQuAD 2.0 dataset:
dataset = load_dataset("squad_v2")

Text generation involves automatically generating coherent text from scratch on a given prompt/topic.
Hugging Face provides access to models like GPT-2, GPT-Neo, BART, T5 that can generate text.
GPT-2 and GPT-Neo are auto-regressive language models trained to predict next word in a sequence.T5 and BART are encoder-decoder models that can be tuned for conditional text generation.
The TextGenerationPipeline handles prompting the model and generating text.
Models are pretrained on huge text corpuses like WebText, BooksCorpus, etc.
Users can fine-tune models on custom datasets using Trainer API.
Generation can be tweaked via parameters like max length, repetition penalty, etc.
Allows generating long-form text like stories, articles, content for websites.
Text generation has applications in conversational bots, creative writing aid, content creation, etc.

Default Example:

from transformers import pipelinetext_generator = pipeline("text-generation", model="gpt2")
prompt = "In the kingdom of artificial intelligence," 
print(text_generator(prompt, max_length=50)[0]["generated_text"])

In the kingdom of artificial intelligence, the ability to be intelligent is the only real power in the realm of matter.

2. Use an advanced model:

from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
model = GPTNeoForCausalLM.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)prompt = "In a world where AI and humans coexist,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# Generate text using beam search
output = model.generate(
input_ids,
max_length=150, 
num_return_sequences=3,
no_repeat_ngram_size=2, 
temperature=0.7,
num_beams=5, # Using beam search with 5 beams
early_stopping=True
)
for i, text in enumerate(tokenizer.batch_decode(output)):
print(f"Generated {i + 1}: {text}")

In this article, we will take a look at how AI can be used to solve some of the most pressing problems in our world today, and how it can help us make the world a better place. We will also explore how we can harness the power of AI to make our lives easier and improve the quality of life for everyone on the planet.
AI and the Internet of Things (IoT) have the potential to revolutionize the way we live, work and interact with each other. It is estimated that by 2020, there will be more than 1.5 billion Internet-connected

Data Sets:

You can check the Hugging Face datasets:

Sentence similarity refers to quantifying how similar two input sentences are semantically.
It has applications in search, FAQ chatbots, duplicate detection, plagiarism checking etc.
Hugging Face provides access to pretrained encoders like sentence-transformers/all-MiniLM-L6-v2 model.
This model encodes input sentences into fixed-length vectors using a Siamese network architecture.
The vectors are compared using cosine similarity to determine closeness between sentences.
Values range from 0 to 1 with 1 indicating identical sentences. A threshold can separate semantic duplicates.
The model is trained on Natural Language Inference (NLI) datasets like SNLI, MultiNLI.
Fine-tuning on domain-specific data can improve performance for niche applications.
The pipeline handles encoding sentences and computing similarity scores automatically.
Overall, it enables building semantic search, duplicate detection, document clustering solutions easily.
Vector comparisons are faster and more robust compared to rules-based semantic matching.

Check out my article on vector databases for more information:

Basic example:

from sentence_transformers import SentenceTransformer, util
import torch# Load a pre-trained sentence-transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Example sentences
sentence1 = "The sky is blue."
sentence2 = "Blue is the color of the sky."
# Convert sentences to embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
# Compute cosine similarity between embeddings
cosine_sim = util.pytorch_cos_sim(embedding1, embedding2)
print(f"Cosine Similarity: {cosine_sim.item()}")

Cosine Similarity: 0.90008145570755

2. Another example using advanced model:

from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import cosine_similarity# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
# Define function to convert sentence to embedding
def get_embedding(sentence, model, tokenizer):
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
# Average the second to last hidden layer of each token to get the sentence embedding
sentence_embedding = torch.mean(outputs.last_hidden_state[0], dim=0)
return sentence_embedding
# Example sentences
sentence1 = "The sky is blue."
sentence2 = "Blue is the color of the sky."
# Get embeddings
embedding1 = get_embedding(sentence1, model, tokenizer)
embedding2 = get_embedding(sentence2, model, tokenizer)
# Compute cosine similarity
similarity = cosine_similarity(embedding1.unsqueeze(0), embedding2.unsqueeze(0))
print(f"Cosine Similarity: {similarity.item()}")

Cosine Similarity: 0.7346132397651672

Check the datasets used for sentence similarity.

Zero-shot classification involves predicting classes that were not seen during model training.
Useful when you have new classes at inference time that were not available for training..
Works better for some classes compared to others based on description.
Useful for datasets with continuously growing or shifting classes.
Avoids retraining model from scratch each time new classes are added.
Overall, enables adapting models to new concepts on the fly without explicit training.

Default Model:

from transformers import pipeline# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")
# Define the sequence to classify and potential labels
sequence = "I love hiking in the mountains."
candidate_labels = ["entertainment", "sports", "nature activity"]
# Classify the sequence
result = classifier(sequence, candidate_labels)
# Display result
print("Sequence:", sequence)
print("Predicted label:", result["labels"][0])
print("Confidence scores:", result["scores"])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Sequence: I love hiking in the mountains.
Predicted label: nature activity
Confidence scores: [0.9644157886505127, 0.01824183762073517, 0.017342381179332733]

2.Using an Advanced Model:

from transformers import BartForSequenceClassification, BartTokenizer
from transformers import pipeline# Load the BART model and tokenizer
model_name = "facebook/bart-large-mnli"
model = BartForSequenceClassification.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Initialize the zero-shot classification pipeline using BART
classifier = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)
# Define the sequence to classify and potential labels
sequence = "I love hiking in the mountains."
candidate_labels = ["entertainment", "sports", "nature activity"]
# Classify the sequence
result = classifier(sequence, candidate_labels)
# Display result
print("Sequence:", sequence)
print("Predicted label:", result["labels"][0])
print("Confidence scores:", result["scores"])

Sequence: I love hiking in the mountains.
Predicted label: nature activity
Confidence scores: [0.9644157886505127, 0.01824183762073517, 0.017342381179332733]

In this example, I specifically use the BART model pre-trained on the MultiNLI (MNLI) dataset, which is suited for zero-shot classification.

NER involves identifying and classifying named entities like people, organizations, and locations in text.
Useful for extracting structured information from unstructured documents.
Hugging Face provides pre-trained NER models like BERT, RoBERTa, XLM-RoBERTa.
These models label words/spans in a text into pre-defined entity categories.
Common entities annotated are PERSON, ORG, LOCATION, DATE, TIME, MONEY, etc.
Models are trained on datasets like CoNLL-2003, OntoNotes, WNUT-17.
The TokenClassificationPipeline handles feeding text to the model and extracting entity labels.
NER models can be fine-tuned on custom data using Trainer API.
Achieve F1 scores of over 90% on common benchmark datasets.
Significantly more accurate than older CRF based statistical NER systems.
Enables information extraction from text for knowledge bases, chatbots, search etc.

Basic -default model:

from transformers import pipelinener = pipeline("ner")
text = "My name is Sarah and I live in London, UK." 
ner_results = ner(text)
for entity in ner_results:
print(entity["word"], entity["entity"])

Sarah I-PER
London I-LOC
UK I-LOC

2. Using an advanced model:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline# Define the model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create a NER pipeline
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)
# Provide a sample text
text = "Elon Musk is the CEO of SpaceX and Tesla."
# Get NER results
ner_results = nlp_ner(text)
# Print results
for entity in ner_results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")

Entity: El, Label: I-PER, Score: 0.9996
Entity: ##on, Label: I-PER, Score: 0.9990
Entity: Mu, Label: I-PER, Score: 0.9993
Entity: ##sk, Label: I-PER, Score: 0.9985
Entity: Space, Label: I-ORG, Score: 0.9992
Entity: ##X, Label: I-ORG, Score: 0.9986
Entity: Te, Label: I-ORG, Score: 0.9964
Entity: ##sla, Label: I-ORG, Score: 0.9953

from transformers import pipeline# Sentiment Analysis
sentiment_pipeline = pipeline("sentiment-analysis")
# Text Classification
classifier = pipeline("text-classification")
# Token Classification (e.g., Named Entity Recognition)
ner_pipeline = pipeline("ner")
# Question Answering
qa_pipeline = pipeline("question-answering")
# Masked Language Modeling
fill_mask = pipeline("fill-mask")
# Summarization
summarizer = pipeline("summarization")
# Translation (e.g., English to French)
translator = pipeline("translation_en_to_fr")
# Feature Extraction
feature_extraction = pipeline("feature-extraction")
# Text Generation
generator = pipeline("text-generation")
# Zero-shot Classification
zero_shot_classifier = pipeline("zero-shot-classification")
# Conversation
conversational_pipeline = pipeline("conversational")