Big Mart Sales Prediction using the XGBoost Regressor, ML | by saibhargav karnati

Importing Dependencies:

To kickstart our sales prediction journey, we first need to import the necessary Python libraries and modules that will facilitate our data analysis and modeling tasks. Let’s break down each of these imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn import metrics

NumPy (np): NumPy is a fundamental package for scientific computing in Python. It provides support for arrays, matrices, and mathematical functions, which are essential for numerical operations in machine learning.
Pandas (pd): Pandas is a powerful library for data manipulation and analysis. It offers data structures like DataFrame and Series, which enable us to handle structured data efficiently.
Matplotlib and Seaborn: These libraries are used for data visualization in Python. Matplotlib provides a MATLAB-like interface, while Seaborn offers a high-level interface for drawing attractive and informative statistical graphics.
LabelEncoder: LabelEncoder is a preprocessing utility in scikit-learn that encodes categorical labels with numeric values. It’s particularly useful for transforming categorical variables into a format suitable for machine learning algorithms.
train_test_split: This function from scikit-learn is used to split datasets into training and testing sets. It’s essential for evaluating the performance of machine learning models on unseen data.
XGBRegressor: XGBoost is an optimized gradient boosting library that’s widely used for regression and classification tasks. XGBRegressor is a regressor variant of the XGBoost algorithm specifically designed for regression problems.
Metrics: Scikit-learn provides various metrics for evaluating the performance of machine learning models. We’ll use these metrics, such as mean squared error (MSE) and R-squared score, to assess the accuracy of our sales prediction model.

Section 2: Load the Dataset

In this section, we’ll load the dataset into a Pandas DataFrame. The dataset contains historical sales data that we’ll use to train our machine learning model.

# loading the data from csv file to Pandas DataFrame
big_mart_data = pd.read_csv('/content/Train.csv')

Here’s a breakdown of what’s happening in this code snippet:

pd.read_csv(): Pandas provides the read_csv() function, which is used to read data from a CSV file into a DataFrame. This function automatically detects the delimiter and parses the data accordingly.
‘/content/Train.csv’: This is the file path of the CSV file containing our dataset. Make sure to provide the correct file path where your dataset is located. In this example, the dataset is named ‘Train.csv’ and is located in the ‘/content/’ directory.
big_mart_data: We assign the DataFrame created by read_csv() to the variable big_mart_data. This DataFrame will hold our entire dataset, including features (independent variables) and the target variable (sales).

By loading the dataset into a Pandas DataFrame, we can easily explore the data, perform data preprocessing tasks, and train our machine learning model.

Section 3: Data Preprocessing

In this section, we’ll perform data preprocessing tasks to ensure that our dataset is clean and ready for model training. Let’s go through each step:

# first 5 rows of the dataframe
print(big_mart_data.head())# number of data points & number of features
print("Shape of the dataset:", big_mart_data.shape)
# getting some information about the dataset
print(big_mart_data.info())
# checking for missing values
print("Missing values in the dataset:")
print(big_mart_data.isnull().sum())

Here’s a detailed breakdown of code snippet:

big_mart_data.head(): This command displays the first 5 rows of the DataFrame big_mart_data. It provides a quick glimpse of the dataset, including the column names and the values in the first few rows.
big_mart_data.shape: This command returns the shape of the dataset, which represents the number of rows (data points) and columns (features) in the DataFrame. It’s crucial to know the dimensions of the dataset before performing any analysis or modeling tasks.
big_mart_data.info(): This method provides concise summary information about the DataFrame, including the data types of each column and the number of non-null values. It’s useful for understanding the overall structure of the dataset and identifying any potential data type mismatches or missing values.
big_mart_data.isnull().sum(): This command calculates the total number of missing values (NaN) for each column in the DataFrame. By calling isnull() on the DataFrame and then sum() on the resulting Boolean DataFrame, we obtain the count of missing values for each column.

Handling Missing Values

In this section, we’ll address missing values in our dataset by imputing them with appropriate values. We’ll use the mean for numerical columns and the mode for categorical columns. Let’s break down each step:

# mean value of "Item_Weight" column
mean_item_weight = big_mart_data['Item_Weight'].mean()# filling the missing values in "Item_weight" column with "Mean" value
big_mart_data['Item_Weight'].fillna(mean_item_weight, inplace=True)
# mode of "Outlet_Size" column
mode_outlet_size = big_mart_data['Outlet_Size'].mode()
# filling the missing values in "Outlet_Size" column with Mode
mode_of_outlet_size = big_mart_data.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))
print("Mode of Outlet Size:")
print(mode_of_outlet_size)
miss_values = big_mart_data['Outlet_Size'].isnull()
print("Missing values in Outlet Size:")
print(miss_values)
big_mart_data.loc[miss_values, 'Outlet_Size'] = big_mart_data.loc[miss_values,'Outlet_Type'].apply(lambda x: mode_of_outlet_size[x])
# checking for missing values
print("Updated missing values in the dataset:")
print(big_mart_data.isnull().sum())

Let’s go through each step:

mean_item_weight = big_mart_data['Item_Weight'].mean(): Calculates the mean value of the “Item_Weight” column using the mean() function.
big_mart_data['Item_Weight'].fillna(mean_item_weight, inplace=True): Fills the missing values in the “Item_Weight” column with the calculated mean value. The fillna() function is used to replace missing values with a specified value. Setting inplace=True ensures that the changes are applied to the DataFrame inplace.
mode_outlet_size = big_mart_data['Outlet_Size'].mode(): Calculates the mode value of the “Outlet_Size” column using the mode() function.
mode_of_outlet_size = big_mart_data.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0])): Creates a pivot table to calculate the mode of “Outlet_Size” for each unique “Outlet_Type”. This is useful for filling missing values in the “Outlet_Size” column based on the corresponding “Outlet_Type”.
big_mart_data.loc[miss_values, 'Outlet_Size'] = big_mart_data.loc[miss_values,'Outlet_Type'].apply(lambda x: mode_of_outlet_size[x]): Uses the previously calculated mode values to fill missing values in the “Outlet_Size” column based on the corresponding “Outlet_Type”.
print("Updated missing values in the dataset:"): Prints the updated count of missing values in the dataset after handling missing values.

By handling missing values appropriately, we ensure that our dataset is clean and ready for further analysis and modeling.

Data Analysis

Summary statistics of numerical columns:

print("Summary statistics of numerical columns:")
print(big_mart_data.describe())

This code snippet prints the summary statistics of numerical columns in the dataset, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. It provides an overview of the distribution and central tendency of numerical features.

Item_Weight distribution:

plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Weight'])
plt.title('Item Weight Distribution')
plt.xlabel('Item Weight')
plt.ylabel('Density')
plt.show()

This code snippet creates a figure with a specified size and plots the distribution of the ‘Item_Weight’ column using Seaborn’s distplot(). The title, x-label, and y-label are set to describe the plot. Finally, plt.show() displays the plot.

Item Visibility distribution:

plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Visibility'])
plt.title('Item Visibility Distribution')
plt.xlabel('Item Visibility')
plt.ylabel('Density')
plt.show()

Similar to the previous snippet, this code plots the distribution of the ‘Item_Visibility’ column and provides appropriate titles and labels for the plot.

Item MRP distribution:

plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_MRP'])
plt.title('Item MRP Distribution')
plt.xlabel('Item MRP')
plt.ylabel('Density')
plt.show()

This code snippet plots the distribution of the ‘Item_MRP’ column and sets the title, x-label, and y-label accordingly.

Item_Outlet_Sales distribution:

plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Outlet_Sales'])
plt.title('Item Outlet Sales Distribution')
plt.xlabel('Item Outlet Sales')
plt.ylabel('Density')
plt.show()

Similar to previous snippets, this code plots the distribution of the ‘Item_Outlet_Sales’ column with appropriate titles and labels.

Outlet_Establishment_Year column:

plt.figure(figsize=(8,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data)
plt.title('Establishment Year of Outlets')
plt.xlabel('Outlet Establishment Year')
plt.ylabel('Count')
plt.show()

This code snippet creates a count plot to visualize the distribution of outlet establishment years. It uses seaborn’s countplot() function and sets the title, x-label, and y-label for better interpretation.

Categorical Features

Let’s break down the code snippet for visualizing a categorical feature:

# Item_Fat_Content column
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_data)
plt.title('Item Fat Content Distribution')
plt.xlabel('Item Fat Content')
plt.ylabel('Count')
plt.show()

Explanation:

plt.figure(figsize=(6,6)): This line creates a new figure with a specified size of 6×6 inches using Matplotlib’s figure() function. It sets up the canvas for the plot.
sns.countplot(x='Item_Fat_Content', data=big_mart_data): This line plots the count of each category in the ‘Item_Fat_Content’ column using Seaborn’s countplot() function. It specifies the column to be plotted on the x-axis and the DataFrame where the data is located.
plt.title('Item Fat Content Distribution'): This line sets the title of the plot to ‘Item Fat Content Distribution’ using Matplotlib’s title() function.
plt.xlabel('Item Fat Content'): This line sets the label for the x-axis to ‘Item Fat Content’ using Matplotlib’s xlabel() function.
plt.ylabel('Count'): This line sets the label for the y-axis to ‘Count’ using Matplotlib’s ylabel() function.
plt.show(): This line displays the plot on the canvas. It’s necessary to call this function to visualize the plot.

This code snippet creates a count plot to visualize the distribution of categories in the ‘Item_Fat_Content’ column.

Categorical Features: Item Type Distribution

Let’s analyze the code snippet for visualizing the distribution of the ‘Item_Type’ column:

# Item_Type column
plt.figure(figsize=(30,6))
sns.countplot(x='Item_Type', data=big_mart_data)
plt.title('Item Type Distribution')
plt.xlabel('Item Type')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

Explanation:

plt.figure(figsize=(30,6)): This line creates a new figure with a larger size of 30×6 inches using Matplotlib’s figure() function. The larger size accommodates the potentially long x-axis labels without overlap.
sns.countplot(x='Item_Type', data=big_mart_data): This line plots the count of each category in the ‘Item_Type’ column using Seaborn’s countplot() function. It specifies the column to be plotted on the x-axis and the DataFrame where the data is located.
plt.title('Item Type Distribution'): This line sets the title of the plot to ‘Item Type Distribution’ using Matplotlib’s title() function.
plt.xlabel('Item Type'): This line sets the label for the x-axis to ‘Item Type’ using Matplotlib’s xlabel() function.
plt.ylabel('Count'): This line sets the label for the y-axis to ‘Count’ using Matplotlib’s ylabel() function.
plt.xticks(rotation=90): This line rotates the x-axis labels by 90 degrees to prevent overlap and improve readability. It uses Matplotlib’s xticks() function with the rotation parameter set to 90 degrees.
plt.show(): This line displays the plot on the canvas. It’s necessary to call this function to visualize the plot.