Text Preprocessing and Classification with Logistic Regression | by Dr. Ernesto Lee

In the vast and evolving world of Natural Language Processing (NLP), the ability to effectively preprocess text data is fundamental. This introductory guide is designed to walk you through the steps of cleaning and preparing text data, and then using it to build a simple yet powerful Logistic Regression classifier. Whether you’re a student, a budding data scientist, or just curious about NLP, this guide is your starting point to understand the core concepts and implement them in Python.

Before diving into the world of text preprocessing and Logistic Regression, you need to set up your Python environment. If you haven’t already, download and install Python from python.org. We’ll also need a few libraries, namely numpy, pandas, scikit-learn, and nltk, which are essential for data manipulation, machine learning, and natural language processing.

Install these libraries using pip (Python’s package installer) by running the following commands in your terminal or command prompt:

!pip install nltk

Text preprocessing is the process of cleaning and preparing text data for use in machine learning models. It’s a crucial step because the quality of data fed into the models determines the accuracy of the results. Key steps in text preprocessing include:

Cleaning Text: Involves removing unnecessary characters like punctuation, digits, or special symbols.
Stemming: Reducing words to their base or root form. For example, “running” becomes “run”.
Lemmatization: Similar to stemming, but it brings context to the words. So, “better” becomes “good”.
Term Frequency-Inverse Document Frequency (TF/IDF): A statistical measure used to evaluate the importance of a word in a document, which is part of a corpus.
Bag of Words (BoW): A representation of text that describes the occurrence of words within a document.

Let’s start with a simple dataset for our example. For simplicity, we’ll create a small dataset of sentences:

import pandas as pd# Sample dataset
data = {
'text': ['This is the first document.', 
'This document is the second document.', 
'And this is the third one.', 
'Is this the first document?'],
'label': [1, 0, 0, 1]
}
df = pd.DataFrame(data)

Now, let’s clean and preprocess this text data:

First, we’ll remove punctuation and convert the text to lowercase:

import string# Function to clean text
def clean_text(text):
text = text.lower()
text = ''.join([char for char in text if char not in string.punctuation])
return text
# Applying the cleaning function to our dataset
df['cleaned_text'] = df['text'].apply(lambda x: clean_text(x))

For stemming and lemmatization, we’ll use the nltk library:

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer# Download necessary NLTK data
nltk.download('wordnet')
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Function to apply stemming
def stem_text(text):
return ' '.join([stemmer.stem(word) for word in text.split()])
# Function to apply lemmatization
def lemmatize_text(text):
return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
# Applying stemming and lemmatization
df['stemmed_text'] = df['cleaned_text'].apply(lambda x: stem_text(x))
df['lemmatized_text'] = df['cleaned_text'].apply(lambda x: lemmatize_text(x))

In the next section, we’ll dive into TF/IDF and Bag of Words, two powerful techniques for converting text into a format that can be used for machine learning.

Logistic Regression is a statistical method used for binary classification, which means it’s used to categorize data into one of two groups. In the context of text classification, it helps us to determine whether a piece of text belongs to a category or not (such as spam or not spam, positive or negative sentiment).

It’s a suitable choice for beginners for several reasons:

Simplicity: Logistic Regression is straightforward to understand and implement.
Efficiency: It performs well with smaller datasets and fewer features.
Interpretability: The output of Logistic Regression can be interpreted in terms of probabilities.

Now, let’s use our preprocessed text data to build a Logistic Regression model.

First, we split our dataset into training and testing sets:

from sklearn.model_selection import train_test_split# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(df['lemmatized_text'], df['label'], test_size=0.2, random_state=42)

We’ll use the TF/IDF representation for our text data and train our Logistic Regression model:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline# Creating a pipeline with TF/IDF vectorizer and Logistic Regression
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LogisticRegression())])
# Training the model
model.fit(X_train, y_train)

This code snippet shows a pipeline that first converts the text data into a TF/IDF representation, and then applies Logistic Regression.

Evaluating the performance of our model is crucial. We use metrics like accuracy, precision, recall, and F1-score:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score# Predicting on the test set
y_pred = model.predict(X_test)
# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))

Each of these metrics gives us different insights into the performance of our model.

Finally, let’s see how our model performs on new, unseen data:

# New data sample
new_data = ["This is a new document to classify."]
new_data_clean = [clean_text(doc) for doc in new_data]  # Cleaning new data# Making predictions
print("Prediction:", model.predict(new_data_clean))

This will output the category that our model predicts for the new text.

In this guide, we’ve covered the basics of text preprocessing, introduced Logistic Regression, and walked through building, evaluating, and using a text classification model. The world of NLP is vast, and there’s much more to explore. I encourage you to experiment with different preprocessing techniques, tweak the model parameters, and see how they affect your model’s performance.

Source link