
When creating a dendrogram in hierarchical clustering, the choice of distance method (linkage criterion) significantly influences the structure and shape of the resulting dendrogram. These distance methods determine how clusters are merged at each step of the hierarchy. Two common methods are “ward” and “average,” and here’s a brief explanation of the difference between them:
- Ward Method
The Ward method minimizes the variance of the distances between data points within clusters. In other words, it seeks to merge clusters that lead to the smallest increase in overall within-cluster variance.
The Ward method often results in compact, balanced, and spherical clusters. It tends to merge clusters that are similar not only in their nearest neighbors but also in the internal structure of the clusters, promoting the formation of cohesive and tight clusters.
2. Average Method
The Average method calculates the average pairwise distance between all data points in two clusters being considered for merging. The Average method can lead to dendrograms with clusters of varying shapes and sizes. It tends to merge clusters based on their overall similarity, but it may not necessarily create clusters with uniform internal structures. This can result in more elongated or irregularly shaped clusters compared to the Ward method.
Deciding on the right number of clusters is akin to finding the perfect balance between granularity and coherence in data grouping. Too few clusters may oversimplify the data, while too many clusters can result in confusion and redundancy.
In the intricate world of clustering, one of the most formidable challenges is determining the optimal number of clusters. Fortunately, the silhouette score emerges as a guiding beacon, helping us make informed decisions about clustering results.
The silhouette score offers a quantitative measure to gauge the quality of clustering results. It evaluates how well each data point fits within its assigned cluster, taking into account both cohesion (similarity to other points in the same cluster) and separation (dissimilarity from points in other clusters).
The value of the silhouette score is between [-1, 1]
- 0.71–1.0 :
A strong structure has been found
- 0.51–0.70 :
A reasonable structure has been found
- 0.26–0.50 :
The structure is weak and could be artificial.
- < 0.25 :
No substantial structure has been found
So, if the silhouette score is negative valued, then your clusters suck.
Now let’s do some coding!
USArrests dataset contains statistics, in arrest per 100,000 residents for assault, murder and rape in each of the 50 US States in 1973. The percentage of the population living in urban areas is also given.
################################
# IMPORT
################################import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
################################
# FUNCTIONS
################################
def outlier_thresholds(dataframe, col, q1=.05, q3=.95, decimal=3):
quartile1=dataframe[col].quantile(q1)
quartile3=dataframe[col].quantile(q3)
iqr=quartile3-quartile1
low_limit= round(quartile1 - (iqr*1.5) , decimal)
up_limit= round(quartile3 + (iqr*1.5), decimal)
return low_limit , up_limit
def check_outlier(dataframe, col_name, q1=0.05, q3=0.95, lower_limit = None, upper_limit = None, print_num=False):
"""
Check for outliers in a specified column of a DataFrame.
Parameters:
dataframe (DataFrame): The DataFrame containing the data to check for outliers.
col_name (str): The name of the column in the DataFrame to analyze for outliers.
q1 (float, optional): The lower quartile percentile (default is 0.05).
q3 (float, optional): The upper quartile percentile (default is 0.95).
lower_limit (float, optional): The lower limit threshold for identifying outliers. If not provided,
it will be calculated based on the specified percentiles.
upper_limit (float, optional): The upper limit threshold for identifying outliers. If not provided,
it will be calculated based on the specified percentiles.
print_num (bool, optional): If True, the function returns the number of outliers and the lower and upper
limit thresholds as a tuple (outliers_count, (lower_limit, upper_limit)).
Default is False.
Returns:
bool or tuple: If print_num is False (default), returns True if outliers are detected in the specified
column; otherwise, returns False. If print_num is True, returns a tuple containing
the number of outliers and the lower and upper limit thresholds.
Example:
# Check for outliers in the 'Age' column of 'my_dataframe'
is_outlier = check_outlier(my_dataframe, 'Age', q1=0.1, q3=0.9)
This function checks for outliers in the specified column of a DataFrame based on quartile percentiles.
You can customize the percentiles and specify lower and upper limit thresholds. If outliers are found,
you can choose to print the number of outliers along with the threshold values.
"""
if (lower_limit != None) & (upper_limit != None):
low_limit = lower_limit
up_limit = upper_limit
else:
low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
if print_num:
return True, dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].shape[0], (low_limit,up_limit)
return True
else:
return False
def plot_dgram(dataframe, method="ward", y:float=0, truncate_mode=None, p:int=30, figsize:tuple=(15,5), interactive:bool=False):
"""
Generate and display a dendrogram plot for hierarchical clustering.
Parameters:
dataframe (DataFrame): The input data to be used for hierarchical clustering.
method (str, optional): The linkage method for hierarchical clustering. Default is "ward".
y (float, optional): The threshold distance at which to draw a horizontal line on the dendrogram.
Default is 0, indicating no threshold line.
truncate_mode (str, optional): If not None, specifies the method to truncate the dendrogram.
Options are "lastp", "level", or None (default).
p (int, optional): The number of clusters to be shown when truncate_mode="lastp".
Default is 30.
figsize (tuple, optional): The size of the plot figure. Default is (15, 5).
interactive (bool, optional): If True, enables interactive mode for plotting (Jupyter Notebook).
Default is False.
Returns:
None
Example:
plot_dgram(my_dataframe, method="ward", y=0.5, truncate_mode="lastp", p=20, figsize=(12, 6), interactive=True)
This function generates a dendrogram plot based on hierarchical clustering of the input data. You can customize
the linkage method, threshold distance, truncation mode, and other plot settings. If using Jupyter Notebook
and setting interactive=True, the plot will be interactive; otherwise, it will be displayed inline.
"""
#you can delete the 'if block' below if you dont use jupiter notebook.
if interactive:
%matplotlib widget
hc = linkage(dataframe, method = method)
plt.figure(figsize=figsize)
plt.title(f"Dendrogram (method : {method})", fontsize = 20)
plt.ylabel("Distance", fontsize=15)
plt.xlabel("Data Point", fontsize=15)
dendrogram(hc, truncate_mode=truncate_mode, p=p, show_contracted=True)
plt.axhline(y=y, color='b', linestyle='--')
plt.show(block=True)
#you can delete the line below if you dont use jupiter notebook.
%matplotlib inline
################################
# EDA
################################
print(df.head())
'''
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
'''
#No missing value
print(df.isnull().sum())
'''
Murder 0
Assault 0
UrbanPop 0
Rape 0
dtype: int64
'''
#No categorical variable
print(df.info())
'''
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Murder 50 non-null float64
1 Assault 50 non-null int64
2 UrbanPop 50 non-null int64
3 Rape 50 non-null float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB
'''
#When we apply IQR analysis with 20% and 80%
#it seems there is no outliers.
for col in df.columns:
print(check_outlier(df,col,q1=0.20, q3=0.80, print_num=True))
'''
False
False
False
False
'''
################################
# Hierarchical Clustering
################################
#NOTE: You should ALWAYS apply SCALING before clustering
#since clustering algorithms works by calculating distances.
#So these distances should be in the same range!
sc = MinMaxScaler()
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
plot_dgram(df, method="average") # IMAGE IS BELOW (dendrogram_average.png)
Now, let’s use ‘ward’ distance method and plot dendrogram again:
sc = MinMaxScaler()
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
plot_dgram(df, method="ward") # IMAGE IS BELOW (dendrogram_ward.png)
Can we decide the number of clusters by looking at the dendrogram? For example, let’s look at the dendrogram plotted with the ‘ward’ method above. First of all, the length of the dendrogram is approximately equal to 3.6 because the 1st horizontal line is at that distance. If the distance between two consecutive horizontal lines is more than approximately 10% of the length of the dendrogram, a dashed line is drawn between those two consecutive horizontal lines, and the number of clusters can be decided according to how many vertical lines are intersected the dashed line.
For example, for the above plot, 10% makes 3.6/10 = 0.36. Now, there is a distance of 3.6–2.0 = 1.6 between the 1st horizontal line and the 2nd horizontal line, so a dashed line can be drawn between the 1st horizontal line and the 2nd horizontal line since 1.6 > 0.36:
sc = MinMaxScaler()
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
plot_dgram(df, method="ward", y = 3.0) # image is below
Since the dashed line above intersects 2 vertical lines, the number of clusters can be 2.
Alternatively, if we calculate the distance between the 3rd horizontal line and the 4th horizontal line, we find approximately 1.6–0.9 = 0.7. Since 0.7 > 0.36, a dashed line can be drawn between the 3rd horizontal line and the 4th horizontal line:
sc = MinMaxScaler()
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
plot_dgram(df, method="ward", y = 1.2) # image is below
In this case, the dashed line intersected 4 vertical lines. Therefore, the number of clusters can be selected as 4.
Now let’s try these number of clusters and see Silhouette scores!
#Hierarchical Clustering(n=2)
cluster = AgglomerativeClustering(n_clusters=2, linkage="ward")
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)clusters_hcluster = cluster.fit_predict(df)
#cluster_hcluster is not a dataframe, it is a numpy array.
#it contains the cluster of each row in the dataset.
print(clusters_hcluster[:5]) # [1 1 1 0 1]
#now let's calculate silhouette score.
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
silhouette_avg = silhouette_score(df, clusters_hcluster)
print("Hierarchical Clustering(n=2) Silhouette Score:", silhouette_avg) # 0.540
Hierarchical Clustering (n=2) Silhouette Score: 0.540
Since 0.54 > 0.5, we can say that when n=2, there is a reasonable structure.
Let’s try a different number of cluster, n=4
#Hierarchical Clustering(n=4)
cluster = AgglomerativeClustering(n_clusters=4, linkage="ward")
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)clusters_hcluster = cluster.fit_predict(df)
#cluster_hcluster is not a dataframe, it is a numpy array.
#it contains the cluster of each row in the dataset.
print(clusters_hcluster[:5]) # [3 1 1 0 1]
#now let's calculate silhouette score.
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
silhouette_avg = silhouette_score(df, clusters_hcluster)
print("Hierarchical Clustering(n=4) Silhouette Score:", silhouette_avg) # 0.195
Hierarchical Clustering (n=4) Silhouette Score: 0.195
Since 0.195 < 0.25 , no substantial structure has been found.
When we compare n=2 and n=4 for hierarchical clustering, we come to the conclusion that n=2 is better.
NOTE : Sometimes, due to the job description, we may need to divide the data into a certain number of clusters. In such cases, even if the Silhouette Score is low, we should divide the dataset into a defined number of clusters.
K-Means
Before continuing, I recommend you watch the video below:
While hierarchical clustering builds a hierarchy of clusters, k-means clustering aims to partition data into a predefined number of clusters (k) based on the similarity of data points.
It starts by randomly selecting k cluster centers and assigns each data point to the nearest center. Then, it recalculates the cluster centers as the mean of all data points within each cluster and repeats the assignment and recalculation process until convergence.
The quality of the initial centroids (cluster centers) can significantly impact the algorithm’s convergence and the quality of the final clustering. This is where “k-means++” initialization comes into play.
K-Means++ : K-Means++ initialization tends to produce more stable and better-performing results compared to purely random initialization, especially when the dataset is complex or the number of clusters (k) is relatively large.
random_state — How can I control the randomness in KMeans?
You can use the random_state
parameter to control the randomness in KMeans. Since K-Means involves random initialization of centroids (either with “k-means++” or purely random), setting the random_state
parameter allows you to reproduce the same results across different runs of the algorithm. By default, random_state is set to None
, thus the algorithm use a different random seed each time it runs, resulting in potentially different centroid initializations and, consequently, different clustering outcomes with each run. So, specifying a random_state
is useful when you want to control the randomness, ensure reproducibility of results, compare different clustering approaches, or debug issues in your code.
Within-Cluster Sum of Squares (WCSS)
WCSS is a metric used to evaluate the performance of a K-Means clustering model. It quantifies the compactness of clusters by measuring the sum of squared distances between data points and their assigned cluster centroids. WCSS tends to decrease as the number of clusters (k) increases because with more clusters, data points are assigned to clusters with centroids that are closer to them. Essentially, smaller clusters can better approximate the data points, resulting in smaller squared distances and lower WCSS values.
As can be seen in the graph above, WCSS approaches zero as the number of clusters increases. In other words, if you create as many clusters as there are data, your WCSS value will be equal to zero because each point is exactly at its cluster centroid. However, this defeats the purpose of clustering, as it does not provide any meaningful grouping or simplification of the data. Also, it often leads to overfitting. The goal of clustering is to find patterns and structures within data while reducing its complexity. Therefore, our aim here should be to decide the most appropriate number of clusters in the range [1, number of data].
The Elbow Method
When using KMeans, we can decide the most appropriate number of clusters with the Elbow method. The Elbow method is a heuristic technique used to determine the optimal number of clusters for K-Means. It involves plotting the WCSS values for different values of k (typically ranging from 1 to a maximum number of clusters you want to explore) and looking for the “elbow” point in the graph.
################################
# K-Means Clustering
#################################KMeans(n=2)
###############
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
kmeans = KMeans(n_clusters=2, random_state=17).fit(df)
#Default parameters
kmeans.get_params()
'''
{'algorithm': 'lloyd',
'copy_x': True,
'init': 'k-means++',
'max_iter': 300,
'n_clusters': 2,
'n_init': 'warn',
'random_state': 5,
'tol': 0.0001,
'verbose': 0}
'''
#see number of clusters
print(kmeans.n_clusters) # 2
#see the center of each clusters. remember, our dataset has 4 features.
#that is, each cluster has a center in each feature.
print(kmeans.cluster_centers_)
'''
[[0.24518072 0.23778539 0.53615819 0.22334195]
[0.68463855 0.72003425 0.61694915 0.56498708]]
'''
#this is our predicted clusters for each row.
print(kmeans.labels_)
'''
[1 1 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0
0 0 1 0 1 1 0 0 0 0 0 0 0]
'''
#Silhouette Score of Kmeans (n=2)
clusters_kmeans = kmeans.labels_
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
silhouette_avg = silhouette_score(df, clusters_kmeans)
print("Silhouette Score of KMeans (n=2):", silhouette_avg) # 0.540
Silhouette Score of KMeans (n=2): 0.540
Since 0.54 > 0.5, we can say that when n=2, there is a reasonable structure.
Remember, we obtained exactly the same Silhouette Score for hierarchical clustering (n=2).
Now, let’s try a different number of cluster, n=4
#KMeans(n=4)
###############
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
kmeans = KMeans(n_clusters=4, random_state=17).fit(df)#Silhouette Score of Kmeans (n=4)
clusters_kmeans = kmeans.labels_
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
silhouette_avg = silhouette_score(df, clusters_kmeans)
print("Silhouette Score of KMeans (n=4):", silhouette_avg) # 0.233
Silhouette Score of KMeans (n=4): 0.233
Since 0.23 < 0.25, no substantial structure has been found.
Lastly, let’s try to find the number of cluster by using Elbow method, and validate it by using Sihouette Score.
#Elbow Method
####################
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
kmeans = KMeans(random_state=17)
#k=(2,30) because it will try # of clusters in this range.
elbow = KElbowVisualizer(kmeans, k=(2, 30))
elbow.fit(df)
elbow.show() #IMAGE IS BELOW (elbow.png)
According to the Elbow method, the number of clusters should be 10. But remember, the Elbow method is a heuristic, meaning it does not give us the globally optimal result. That’s why we should be skeptical and check the Silhouette Score:
#KMeans (n=10)
###########################df = pd.read_csv("datasets/USArrests.csv", index_col=0)
df = sc.fit_transform(df)
kmeans = KMeans(n_clusters=10, random_state=17).fit(df)
#Silhouette Score of Kmeans (n=10)
clusters_kmeans = kmeans.labels_
df = pd.read_csv("datasets/USArrests.csv", index_col=0)
silhouette_avg = silhouette_score(df, clusters_kmeans)
print("Silhouette Score of KMeans (n=10):", silhouette_avg) # -0.074
Silhouette Score of KMeans (n=10): -0.074
Since it is a negative value, it seems that no pattern was caught with 10 clusters.
Therefore, we should not rely too much on the number of clusters we obtain from the Elbow method and should definitely look at the Silhouette Score. Typically, it would be safer to make a comment about the number of clusters by looking at the dendrogram rather than the Elbow method.
Thanks for reading…