Breaking into the world of Natural Language Processing through Hugging Face | by Fatima Arshad

As a corporate Data Scientist, one has to stay up-to-date with current tools and technologies for efficient delivery of AI products. Particularly if you’re targeting a role in Natural Language Processing (NLP), the Hugging Face becomes exceptionally invaluable.

Hugging Face, an open-source provider of NLP tools, is one such platform spearheading this transformation. It contains over 10,000 models that can be finetuned.

This article introduces you to Hugging Face and its premier library, Transformers, to help you embark on your NLP journey.

Hugging Face is an open-source platform that provides tools and resources for NLP, Machine Learning, and Data Science. It’s best known for its library, Transformers, which offers thousands of pre-trained models to perform tasks like text classification, information extraction, summarization, translation, and more.

Datasets

Hugging Face offers the Datasets library, which facilitates access and preprocessing for a vast range of NLP datasets.

Each Dataset has a Dataset card. It is a structured documentation that provides essential details about a dataset. It includes metadata about the dataset, such as its description, source, version, size, licensing information, and any known biases in the data. Dataset Cards also outline the structure of the dataset, explaining each field and its possible values. Furthermore, it includes information about data collection and processing, annotation processes, and data splits.

Models

Hugging Face also has a Model Hub, a collaborative platform where AI researchers and developers can share and collaborate on models.

Each Model has a Model card which is similar to ReadMe file. It contains information regarding Model and how to utilize it.

You can also take a look at files pertaining to specified model

Inference API

Furthermore, they offer Inference API that allows developers to integrate Hugging Face models into applications without dealing with the nuances of setting up the model infrastructure.

Install all the relevant libraries

pip install datasets transformers seqeval

Next step is to load our dataset. This dataset consists multiple datasets each for different purpose.

We are specifically interested in Adverse Event detection in text. So, we will be loading this dataset

datasets = load_dataset("ade_corpus_v2", "Ade_corpus_v2_drug_ade_relation")

Now, we will load our Sci-BERT.

task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The variable task is defined as “ner”, signifying a Named Entity Recognition task, which identifies named entities in text. The model_checkpoint variable is assigned the identifier “allenai/scibert_scivocab_uncased”, which points to the SciBERT model with a scientific vocabulary and uncased text, provided by the Allen Institute for AI on the Hugging Face Model Hub. The third line initializes a tokenizer, a key element in any NLP task, responsible for converting input text into tokens, the format that the model can process. The AutoTokenizer.from_pretrained() function automatically detects and loads the corresponding tokenizer class and its pre-trained weights for the specified model checkpoint.

In order to start training we have to set up training arguments.

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
f"{model_name}-finetuned-{task}",
evaluation_strategy = "epoch",
learning_rate=1e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.05,
logging_steps=1
)

This block of code is setting up the training parameters for a machine learning model using Hugging Face’s Transformers library.

model_name = model_checkpoint.split("/")[-1]: This line is extracting the name of the model from the model_checkpoint string.
args = TrainingArguments(...): This line is setting up the training arguments for the model:
f"{model_name}-finetuned-{task}": This is an output directory where the results (model, configuration, tokenizer) will be saved. The name of the directory is generated by combining the model_name with the string “finetuned” and the task.
evaluation_strategy = "epoch": Determines how often evaluation should be performed. When set to “epoch”, it will perform evaluation at the end of each training epoch.
per_device_train_batch_size=batch_size and per_device_eval_batch_size=batch_size: These define the batch size for training and evaluation, which is the number of samples to work through before the model’s internal parameters are updated and the number of samples the model evaluates in one pass respectively.

data_collator = DataCollatorForTokenClassification(tokenizer)

A data collator is a function used during training to form a batch by combining multiple data samples. In this context, DataCollatorForTokenClassification is a specific type of data collator provided by Hugging Face’s Transformers library that’s suitable for tasks like Named Entity Recognition (NER), where each token in a sentence needs to be classified. It takes a tokenizer as an argument, which is used to convert input text data into a format that the model can understand. The Data Collator handles the necessary formatting for model input, such as padding variable length sequences to ensure that all sequences in a batch are of the same length, and token-level tasks-specific arrangements.

Next, We will load our metric

metric = load_metric("seqeval")

This line is loading a metric used to evaluate the model’s performance. load_metric is a function from Hugging Face’s datasets library that loads a metric by name. In this case, the “seqeval” metric is being loaded, which is commonly used for sequence labeling tasks. seqeval calculates precision, recall, and F1-score for tasks like NER or POS tagging, taking into account the sequences of labels rather than just individual labels.

In order to start training we will initialize our trainer

trainer = Trainer(
model,
args,
train_dataset=labeled_dataset["train"],
eval_dataset=labeled_dataset["test"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics, )

trainer.train()

Hugging Face offers a highly streamlined implementation for model prediction pipelines through built-in pipeline function. This streamlined method can be employed to swiftly apply Named Entity Recognition (NER) models on data, as illustrated below:

effect_ner_model = pipeline(task="ner", model=model, tokenizer=tokenizer, device=0)

# something from our validation set
effect_ner_model(labeled_dataset["test"][4]["text"])

Hugging Face’s Transformers library provides a simple, powerful toolset for working with state-of-the-art NLP models. Whether you’re a beginner or an experienced AI practitioner, Hugging Face opens up an accessible pathway to incorporate NLP into your projects. Enjoy exploring the possibilities of language and AI with Hugging Face!

Please remember to have your Python environment set up and be mindful of necessary computational resources when running these deep learning models.

Justin S. Lee has kindly consented to the use of their code for the purpose of this article. You can find full code here: https://github.com/jsylee/personal-projects/blob/master/Hugging%20Face%20ADR%20Fine-Tuning/SciBERT%20ADR%20Fine-Tuning.ipynb

You can find me on Instagram: @IDissectData

Source link