![](https://crypto4nerd.com/wp-content/uploads/2023/01/0Ji3Yvb2oIKe8Hr3U.png)
This project is originally published in Kaggle as weel as Github.
Customer segmentation helps businesses to categorize customers and marketing in the preferable category of customers. This way businesses need less money, time and effort. Thus, categorization of customers is our task in this project. It is important to define the customer profile. It allows customers to prepare campaigns and deliver them to the right customer. I tried to determine the correct customer profile by using clustering algorithms.
What Is Customer Segmentation?
Customer segmentation is a marketing strategy in which select groups of consumers are identified so that certain products or product lines can be presented to them in a way that appeals to their interests.
Types of Customer segmentation
There are four major types of customer segmentation. These are explained below —
- Geographic segmentation: This type of segmentation is based on geographical location, such as region, city, or country. This type of segmentation can be useful for identifying regional differences in customer preferences and needs.
- Demographic segmentation: This type of segmentation is based on demographic characteristics such as age, gender, income, education level, and occupation. This type of segmentation is useful for identifying specific groups of customers with similar needs and characteristics.
- Psychographic segmentation: This type of segmentation is based on customers’ personality, values, and lifestyle. This type of segmentation can help identify customers with similar interests and lifestyles, which can be useful for creating targeted marketing campaigns.
- Behavioral segmentation: This type of segmentation is based on customers’ behaviors, such as purchasing habits, loyalty, and usage patterns. This type of segmentation can help identify customers who are most likely to make repeat purchases or who are most likely to be loyal to a brand.
Customer segmentation realizes that not all customers have the same interests, purchasing power, or consumer needs. Instead of catering to all prospective clients broadly, Customer segmentation is important because it strives to make a company’s marketing endeavors more strategic and refined. By developing specific plans for specific products with target audiences in mind, a company can increase its chances of generating sales and being more efficient with resources.
import numpy as np
import pandas as pd import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
import warnings
warnings.filterwarnings(action='ignore')
import os
print(os.listdir("../input"))
Here, we imports several libraries commonly used for data analysis and visualization.
- Numpy and pandas are used for data manipulation and handling.
- Matplotlib and seaborn are used for creating plots and visualizations.
- Sklearn’s KMeans module is used for k-means clustering, a type of unsupervised learning.
- Scipy’s linkage and dendrogram modules are used for hierarchical clustering, another type of unsupervised learning.
- Sklearn’s AgglomerativeClustering module is also used for hierarchical clustering.
Additionally, we suppresses warnings and imports the os library to access the input directory of the current working directory. The final line uses the os library to print out the contents of the input directory.
data = pd.read_csv('../input/clustering-data/Mall_Customers.csv')
data.head()
We are using the pandas library to read a CSV file called “Mall_Customers.csv” located in a directory called “input/clustering-data” that is one level above the current working directory. It then uses the “head()” function to display the first five rows of the data that was read from the CSV file. The resulting data is stored in a variable called “data.”
def feature_count(data):total_missing = data.isnull().sum().sort_values(ascending = False)
percent_missing = (data.isnull().sum()/
data.shape[0]*100).sort_values(ascending = False)
missing = pd.concat([total_missing, percent_missing], axis=1,
keys=['Total', 'Percent'])
total_data = data.count()
tt = pd.DataFrame(total_data)
tt.columns = ['Total']
uniques = []
for col in data.columns:
unique = data[col].nunique()
uniques.append(unique)
tt['Uniques'] = uniques
unique = tt
return pd.concat([missing, unique], axis=1, keys=['Missing', 'UNIQUE'])
The code is a function called “feature_count” that takes in a single parameter “data” which is a pandas dataframe. The function performs several operations on the dataframe:
- It calculates the total number of missing values in each column of the dataframe and sorts them in descending order using the “isnull()” and “sum()” methods. It also calculates the percentage of missing values in each column by dividing the total number of missing values by the total number of rows in the dataframe and sorts them in descending order. Both of these values are stored in a variable called “total_missing” and “percent_missing” respectively.
- It then creates a new variable called “missing” which is a dataframe that concatenates “total_missing” and “percent_missing” side by side using the “concat()” method. The columns are labeled “Total” and “Percent” respectively.
- It then calculates the total number of non-missing values in each column of the dataframe using the “count()” method and stores it in a variable called “total_data”. It then creates a new dataframe called “tt” which has a single column “Total” and the values from “total_data”.
- It then creates an empty list called “uniques” and loops through each column in the dataframe using a for loop. For each column, it calculates the number of unique values in the column using the “nunique()” method and appends it to the “uniques” list. It then adds a new column called “Uniques” to the “tt” dataframe with the values from the “uniques” list.
- It then creates a new variable called “unique” which is equal to the “tt” dataframe.
- Finally, it returns a new dataframe which concatenates the “missing” and “unique” dataframes side by side using the “concat()” method. The columns are labeled “Missing” and “UNIQUE” respectively.
feature_count(data)
Calling the custom function feature_count
data.info()
The code is calling the info()
method on an object called data
. The info()
method is a built-in method in the pandas library that is used to get a concise summary of the dataframe, including the number of rows, the number of columns, the column data types, and the memory usage of the dataframe. This method can be used to quickly check the basic structure and contents of a dataframe without needing to manually print or inspect each individual element.
print(pd.isnull(data).sum().sum())
data.describe().T
The first line of code is using the pandas library’s “isnull()” function to check for any null (missing) values in the “data” variable. The “.sum()” function is then called twice to add up the total number of null values in the data. This number is then printed using the “print()” function.
The second line of code is using the pandas library’s “describe()” function to generate some basic statistics about the data in the “data” variable. The “.T” at the end of the line is transposing the output, meaning it will change the rows and columns of the output. This is useful for displaying the statistics in a more readable format. The output of this command will provide information such as the mean, standard deviation, minimum, maximum, and quartiles of each column in the data.
data.corr()
Showing correlation between data
plt.figure(figsize=(10,7))
sns.heatmap(data.corr(), annot=True,cmap='viridis')
plt.show()
This code is creating a heatmap visualization of the correlation values between different variables in the data set. The code is using the library matplotlib (plt) and seaborn (sns) to create the visualization.
The first line, “plt.figure(figsize=(10,7))”, is setting the size of the visualization to be 10 inches wide and 7 inches tall.
The second line, “sns.heatmap(data.corr(), annot=True,cmap=’viridis’)”, is creating the heatmap using the seaborn library. The “data.corr()” part of the code is calculating the correlation values between the variables in the data set. The “annot=True” part of the code is displaying the correlation values in the cells of the heatmap. The “cmap=’viridis’” part of the code is setting the color scheme of the heatmap to be the viridis color map.
The final line, “plt.show()”, is displaying the visualization.
sns.set_style(style='whitegrid')
plt.figure(figsize=(8,5))
sns.countplot(data.Gender)
plt.show()
This code uses the Seaborn library to create a bar plot of the count of observations in a dataset by the variable “Gender”.
- The first line, “sns.set_style(style=’whitegrid’)” sets the style of the plot to have a white background with gridlines.
- The second line, “plt.figure(figsize=(8,5))” creates a new figure with a width of 8 inches and a height of 5 inches.
- The third line, “sns.countplot(data.Gender)” creates a bar plot of the count of observations in the “data” dataset by the “Gender” variable.
- The last line, “plt.show()” displays the plot.
This code is used to visualize the distribution of the gender of the data which is available in the dataset.
labels = ['Male','Female']
sizes = [data.query('Gender == "Male"').Gender.count(),
data.query('Gender == "Female"').Gender.count()]
#colors
colors = ['#ffdaB9','#66b3ff']
#explsion
explode = (0.05,0.05)
plt.figure(figsize=(8,8))
my_circle=plt.Circle( (0,0), 0.7, color='white')
plt.pie(sizes, colors = colors, labels=labels, autopct='%1.1f%%',
startangle=90, pctdistance=0.85,explode=explode)
p=plt.gcf()
plt.axis('equal')
p.gca().add_artist(my_circle)
plt.show()
This code is creating a pie chart to display the gender breakdown of a data set. The labels for the chart are set to “Male” and “Female” and the sizes of the slices of the pie chart are determined by counting the number of males and females in the data set using the query function. The colors for the slices are set to a pale pink and light blue. The explode variable is set to 0.05, which adds a slight separation between the slices on the chart. The chart is then created using the plt.pie function, with various parameters such as the autopct, startangle, and pctdistance being set. The chart is then made to have equal axis and a white circle is added in the middle of the chart. Finally, the chart is displayed using the plt.show() function.
plt.figure(figsize=(20,10))
sns.countplot(data.Age)
plt.xlabel("Age")
plt.ylabel("Person Count")
plt.show()
This code is creating a graph using the Python library matplotlib (plt) and seaborn (sns). The first line is creating a new figure with a specific size (20 inches wide and 10 inches tall). The next line is creating a countplot using the data provided and the age column. The x-axis is labeled “Age” and the y-axis is labeled “Person Count”. Finally, the code is displaying the graph using the plt.show() function. This graph is showing the count of people in different age groups.
plt.figure(figsize=(20,7))
gender = ['Male', 'Female']
for i in gender:
plt.scatter(x='Age',y='Annual Income (k$)',
data=data[data['Gender']==i],s = 200 , alpha = 0.5 ,
label = i)
plt.legend()
plt.xlabel("Age")
plt.ylabel("Annual Income (k$)")
plt.title("Annual Income according to Age")
plt.show()
This code is creating a scatter plot to visualize the relationship between age and annual income for two groups: Male and Female.
The first line, plt.figure(figsize=(20,7)), sets the size of the plot to 20×7 inches.
The next two lines, gender = [‘Male’, ‘Female’] and the for loop, iterate through the two groups (Male and Female) and create a scatter plot for each group with the following parameters:
- x-axis is set to ‘Age’
- y-axis is set to ‘Annual Income (k$)’
- data used is filtered by the current group (‘data[data[‘Gender’]==i]’)
- marker size is set to 200
- marker transparency is set to 0.5
- each group is labeled with their respective group name (‘label = i’)
The plt.legend() line creates a legend for the plot, using the labels specified in the for loop.
The next two lines, plt.xlabel(“Age”) and plt.ylabel(“Annual Income (k$)”), add labels for the x and y axis.
The plt.title(“Annual Income according to Age”) line adds a title for the plot.
Finally, plt.show() displays the plot.
plt.figure(figsize=(20,7))
gender = ['Male', 'Female']
for i in gender:
plt.scatter(x='Age',y='Spending Score (1-100)',
data=data[data['Gender']==i],s = 200 , alpha = 0.5,
label = i)
plt.legend()
plt.xlabel("Age")
plt.ylabel("Spending Score (1-100)")
plt.title("Spending Score according to Age")
plt.show()
This code is creating a scatter plot that displays the spending score (1–100) of individuals according to their age, separated by gender.
- The first line creates a new figure with a specified size (20×7 inches).
- The next line creates a list of the two genders, “Male” and “Female”.
- The for loop then iterates through the list of genders, and for each gender, it creates a scatter plot using the data from the variable “data” where the gender matches the current iteration. The x-axis is set to “Age” and the y-axis is set to “Spending Score (1–100)”. The size of the dots is set to 200, the transparency (alpha) is set to 0.5, and the label is set to the current gender iteration.
- The next line creates a legend for the plot using the labels set in the for loop.
- The following two lines label the x and y axis of the plot respectively.
- The final line sets the title of the plot to “Spending Score according to Age” and shows the plot.
So the overall purpose of the code is to show the relationship between the spending score of a person and their age and how it differs between Male and Female.
plt.figure(figsize=(20,7))
gender = ['Male', 'Female']
for i in gender:
plt.scatter(x='Annual Income (k$)',y='Spending Score (1-100)',
data=data[data['Gender']==i],s = 200 , alpha = 0.5 ,
label = i)
plt.legend()
plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title("Spending Score according to Annual Income")
plt.show()
This code uses the matplotlib library (plt) to create a scatter plot that shows the relationship between annual income and spending score for two groups: Male and Female.
First, it creates a new figure with a specified size (20 inches wide, 7 inches tall).
Next, it creates a list of the two groups (Male and Female) and uses a for loop to iterate through each group. For each group, it uses the plt.scatter function to plot the data on the x-axis as annual income and the y-axis as spending score. It filters the data to only include the data for the current group being iterated through (i.e. data where the Gender column equals “Male” or “Female”). The size of the points on the scatter plot is set to 200 and the alpha (transparency) is set to 0.5. It also sets the label for each group on the legend.
After the for loop, it adds a legend to the plot, labels the x-axis as “Annual Income (k$)” and the y-axis as “Spending Score (1–100)”, and sets the title of the plot as “Spending Score according to Annual Income”.
Finally, it uses plt.show() to display the plot.
KMeans Clustering
#define k value
wcss = []
data_model = data.drop(['Gender','CustomerID'],axis=1)
for k in range(1,15):
kmeans = KMeans(n_clusters=k)
kmeans.fit(data_model)
wcss.append(kmeans.inertia_)# the best value is elbow value. It's 5.
plt.figure(figsize=(15,5))
plt.plot(range(1,15),wcss)
plt.xlabel("number of k (cluster) value")
plt.ylabel("wcss")
plt.show()
The code is using the KMeans algorithm to determine the optimal number of clusters (k value) for a given dataset (data). The dataset has columns for Gender and CustomerID, which are dropped using the .drop() method with the axis=1 parameter indicating that the columns are being dropped.
The variable wcss is created and initialized as an empty list. The for loop iterates through the range of 1 to 15 (inclusive) for the number of clusters (k value). For each iteration, a KMeans object is created with the current iteration’s k value passed as the n_clusters parameter. The fit method is then called on the KMeans object to fit the model to the data. The inertia_ attribute of the KMeans object is then appended to the wcss list.
After the loop completes, the wcss list contains the within-cluster sum of squares (WCSS) for each iteration’s k value. The WCSS is a measure of the variance within each cluster and is used to evaluate the quality of a clustering.
A plot is then created using the matplotlib library, with the x-axis representing the k value and the y-axis representing the WCSS. The elbow value is the point on the plot where the WCSS begins to decrease at a slower rate, indicating that additional clusters are not providing significant additional value. In this case, the elbow value is at k=5. The plot is then displayed using plt.show().
#create model
kmeans = KMeans(n_clusters=5)
data_predict = kmeans.fit_predict(data_model)plt.figure(figsize=(15,10))
plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)',
data = data_model , c = data_predict , s = 200 )
plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.show()
The code is creating a KMeans model with 5 clusters and using the fit_predict method to fit the data_model to the model and predict the clusters for each data point. It then creates a scatter plot of the data_model with the x-axis as “Annual Income (k$)” and the y-axis as “Spending Score (1–100)” and uses the data_predict variable as the colors for the data points. The figure size is set to 15×10 and the scatter points have a size of 200. The x and y axis labels are set to “Annual Income (k$)” and “Spending Score (1–100)” respectively and the plot is displayed using the plt.show() function.
Hierarchical clustering
merg = linkage(data_model,method="ward")
plt.figure(figsize=(25,10))
dendrogram(merg,leaf_rotation = 90)
plt.xlabel("data points")
plt.ylabel("euclidean distance")
plt.show()
This code is using the linkage() and dendrogram() functions from the scipy library to perform hierarchical clustering on a dataset (stored in the variable “data_model”) and visualize the resulting dendrogram.
The linkage() function is used to calculate the linkage matrix, which encodes the hierarchical clustering information. The method used is “ward”, which minimizes the variance of the distances between the clusters being merged.
The dendrogram() function is then used to visualize the linkage matrix as a dendrogram. The leaf_rotation argument is set to 90, which rotates the leaves of the dendrogram by 90 degrees.
The plt.figure() function is used to set the size of the dendrogram plot, and plt.xlabel() and plt.ylabel() are used to add labels to the x and y axes of the plot. Finally, plt.show() is used to display the plot on the screen.
#create model
hiyerartical_cluster = AgglomerativeClustering(n_clusters = 5,
affinity= "euclidean",linkage = "ward")
data_predict = hiyerartical_cluster.fit_predict(data_model)
plt.figure(figsize=(15,10))
plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' ,
data = data_model , c = data_predict , s = 200 )
plt.show()
This code is creating an instance of the AgglomerativeClustering class from the sklearn library with the following parameters:
- n_clusters = 5: this sets the number of clusters that the algorithm will create.
- affinity = “euclidean”: this sets the distance metric that the algorithm will use to calculate the similarity between data points. In this case, it is using the euclidean distance.
- linkage = “ward”: this sets the linkage criteria that the algorithm will use to merge clusters. In this case, it is using the ward linkage criteria.
The next line, “data_predict = hiyerartical_cluster.fit_predict(data_model)”, is fitting the clustering model to the “data_model” data and predicting the cluster assignments for each data point.
The following lines are creating a scatter plot of the data with the x-axis being “Annual Income (k$)” and the y-axis being “Spending Score (1–100)”. The points on the plot will be colored based on the cluster assignment predicted by the algorithm (c = data_predict) and will have a size of 200 (s = 200). The plt.show() command is displaying the plot.