![](https://crypto4nerd.com/wp-content/uploads/2023/07/0F4s1Ub0JGnrt1LZw-1024x683.jpeg)
Install the hugging face package
!pip install transformers
In google colab use the above code to install the hugging face package
I love to split’em up
Like a sadistic divorce lawyer would say You would be better off separate!
We have to separate all the words in a sentence into tokens, for that we have to first import an object called the tokenizer. Tokenizer class contains all methods related to creating tokens.
Tokenizing means converting the words in a sentence into unique numbers because models only understand numbers. We can create a unique tokenizer if we were creating the model from scratch but as we are taking a pre-trained model, we will take the corresponding tokenizer, so that the words mean the same thing in our dataset and the pre-trained model.
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
The transformer we are going to use is DistilBERT which is a lightweight counterpart of BERT, so we will download the corresponding tokenizer.
Next we are creating an instance of the tokenizer and then loading pre-trained values into the tokenizer with the second line. ‘distilbert-base-uncased’ is case insensitive.
Create the dataset
import the dataset and create a dataset using pandas(don’t forget to import pandas)
df = pd.read_json("/content/drive/MyDrive/Sarcasm_Headlines_Dataset.json")
df = df[['headline','is_sarcastic']]
Now we will create a TensorFlow dataset, Instead of using numpy arrays, it is better to use a TensorFlow.data.Dataset as it provides hoards of functionalities.
import tensorflow as tf
buffer_size = 64is_train = np.random.uniform(size = len(df))<0.8
train_raw = (tf.data.Dataset.
from_tensor_slices((dict(tokenizer(list(df['headline'][is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[is_train ])).
shuffle(len(df)).
batch(64,drop_remainder = True)
)
train_raw.prefetch(1)
test_raw = (tf.data.Dataset.
from_tensor_slices((dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[~is_train ])).
shuffle(len(df)).
batch(64, drop_remainder = True)
)
test_raw.prefetch(1)
The data is divided into two dataset, train_raw for training and test_raw for evaluation. Both of them have the same structure. 80% of data in train_raw and rest in test_raw
Here the line I think that needs the most explanation is this line
from_tensor_slices((dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[~is_train ]))
from_tensor_slices is a method that is used to create the dataset, it should receive a tuple as the argument of the form (input, output),
where the input is —
dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True))
I will be explaining this line inside out —
First the headlines are selected from the dataset from the indexes where is_train is 1. This will then be converted to a list by passing into a list() method.
The list of headlines will then be passed into the tokenizer we initialiazed, padding = True and truncation = True parameters make all the tensors of the same size.
Convert this tokenizer into a dictionary, this step is really important as the model won’t run as the output of the tokenizer is a BatchEncoding object which is a subclass of dict, but this object is not recognized by the keras so it is converted to a dictionary.
The output is —
np.array(df['is_sarcastic'])[is_train ]
Here the second row of the data frame is selected which consists of 1 if sarcastic and 0 if not. The row is converted to a numpy array. from this array only those indexes where is_train is 1 is selected
Load the Guns and fire
Look at the third line, we are importing a transformer TFDisitilBertForSequenceClassification which is a transformer architecture for multi-class classification but here we will use it for binary classification.
The optimizer used here is Adam, you can use your own custom made optimizer too. The loss function used is sparse categorical loss, you can also binary cross entropy loss.
from tensorflow.keras.optimizers import
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from transformers import TFDistilBertForSequenceClassification
num_epochs = 3
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
adam = Adam()model.compile(
optimizer=Adam(5e-5),
metrics=["accuracy"],
loss = SparseCategoricalCrossentropy(from_logits = True)
)
model.fit(
train_raw,
validation_data=test_raw,
epochs = num_epochs
)