When creating a dendrogram in hierarchical clustering, the choice of **distance method** (linkage criterion) significantly influences the structure and shape of the resulting dendrogram. These distance methods determine how clusters are merged at each step of the hierarchy. Two common methods are “ward” and “average,” and here’s a brief explanation of the difference between them:

**Ward Method**

The Ward method minimizes the variance of the distances between data points within clusters. In other words, it seeks to merge clusters that lead to the smallest increase in overall within-cluster variance.

The Ward method often results in compact, balanced, and spherical clusters. It tends to merge clusters that are similar not only in their nearest neighbors but also in the internal structure of the clusters, promoting the formation of cohesive and tight clusters.

2.** Average Method**

The Average method calculates the average pairwise distance between all data points in two clusters being considered for merging. The Average method can lead to dendrograms with clusters of varying shapes and sizes. It tends to merge clusters based on their overall similarity, but it may not necessarily create clusters with uniform internal structures. This can result in more elongated or irregularly shaped clusters compared to the Ward method.

Deciding on the right number of clusters is akin to finding the perfect balance between granularity and coherence in data grouping. Too few clusters may oversimplify the data, while too many clusters can result in confusion and redundancy.

In the intricate world of clustering, one of the most formidable challenges is determining the optimal number of clusters. Fortunately, the silhouette score emerges as a guiding beacon, helping us make informed decisions about clustering results.

The silhouette score offers a quantitative measure to gauge the quality of clustering results. It evaluates how well each data point fits within its assigned cluster, taking into account both cohesion (similarity to other points in the same cluster) and separation (dissimilarity from points in other clusters).

The value of the silhouette score is between [-1, 1]

**0.71–1.0 :**`A strong structure has been found`

**0.51–0.70 :**`A reasonable structure has been found`

**0.26–0.50 :**`The structure is weak and could be artificial.`

**< 0.25 :**`No substantial structure has been found`

So, if the silhouette score is negative valued, then your clusters suck.

Now let’s do some coding!

USArrests dataset contains statistics, in arrest per 100,000 residents for assault, murder and rape in each of the 50 US States in 1973. The percentage of the population living in urban areas is also given.

`################################`

# IMPORT

################################import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.preprocessing import MinMaxScaler

from yellowbrick.cluster import KElbowVisualizer

from scipy.cluster.hierarchy import linkage

from scipy.cluster.hierarchy import dendrogram

from sklearn.metrics import silhouette_score

from sklearn.cluster import AgglomerativeClustering

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

################################

# FUNCTIONS

################################

def outlier_thresholds(dataframe, col, q1=.05, q3=.95, decimal=3):

quartile1=dataframe[col].quantile(q1)

quartile3=dataframe[col].quantile(q3)

iqr=quartile3-quartile1

low_limit= round(quartile1 - (iqr*1.5) , decimal)

up_limit= round(quartile3 + (iqr*1.5), decimal)

return low_limit , up_limit

def check_outlier(dataframe, col_name, q1=0.05, q3=0.95, lower_limit = None, upper_limit = None, print_num=False):

"""

Check for outliers in a specified column of a DataFrame.

Parameters:

dataframe (DataFrame): The DataFrame containing the data to check for outliers.

col_name (str): The name of the column in the DataFrame to analyze for outliers.

q1 (float, optional): The lower quartile percentile (default is 0.05).

q3 (float, optional): The upper quartile percentile (default is 0.95).

lower_limit (float, optional): The lower limit threshold for identifying outliers. If not provided,

it will be calculated based on the specified percentiles.

upper_limit (float, optional): The upper limit threshold for identifying outliers. If not provided,

it will be calculated based on the specified percentiles.

print_num (bool, optional): If True, the function returns the number of outliers and the lower and upper

limit thresholds as a tuple (outliers_count, (lower_limit, upper_limit)).

Default is False.

Returns:

bool or tuple: If print_num is False (default), returns True if outliers are detected in the specified

column; otherwise, returns False. If print_num is True, returns a tuple containing

the number of outliers and the lower and upper limit thresholds.

Example:

# Check for outliers in the 'Age' column of 'my_dataframe'

is_outlier = check_outlier(my_dataframe, 'Age', q1=0.1, q3=0.9)

This function checks for outliers in the specified column of a DataFrame based on quartile percentiles.

You can customize the percentiles and specify lower and upper limit thresholds. If outliers are found,

you can choose to print the number of outliers along with the threshold values.

"""

if (lower_limit != None) & (upper_limit != None):

low_limit = lower_limit

up_limit = upper_limit

else:

low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)

if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):

if print_num:

return True, dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].shape[0], (low_limit,up_limit)

return True

else:

return False

def plot_dgram(dataframe, method="ward", y:float=0, truncate_mode=None, p:int=30, figsize:tuple=(15,5), interactive:bool=False):

"""

Generate and display a dendrogram plot for hierarchical clustering.

Parameters:

dataframe (DataFrame): The input data to be used for hierarchical clustering.

method (str, optional): The linkage method for hierarchical clustering. Default is "ward".

y (float, optional): The threshold distance at which to draw a horizontal line on the dendrogram.

Default is 0, indicating no threshold line.

truncate_mode (str, optional): If not None, specifies the method to truncate the dendrogram.

Options are "lastp", "level", or None (default).

p (int, optional): The number of clusters to be shown when truncate_mode="lastp".

Default is 30.

figsize (tuple, optional): The size of the plot figure. Default is (15, 5).

interactive (bool, optional): If True, enables interactive mode for plotting (Jupyter Notebook).

Default is False.

Returns:

None

Example:

plot_dgram(my_dataframe, method="ward", y=0.5, truncate_mode="lastp", p=20, figsize=(12, 6), interactive=True)

This function generates a dendrogram plot based on hierarchical clustering of the input data. You can customize

the linkage method, threshold distance, truncation mode, and other plot settings. If using Jupyter Notebook

and setting interactive=True, the plot will be interactive; otherwise, it will be displayed inline.

"""

#you can delete the 'if block' below if you dont use jupiter notebook.

if interactive:

%matplotlib widget

hc = linkage(dataframe, method = method)

plt.figure(figsize=figsize)

plt.title(f"Dendrogram (method : {method})", fontsize = 20)

plt.ylabel("Distance", fontsize=15)

plt.xlabel("Data Point", fontsize=15)

dendrogram(hc, truncate_mode=truncate_mode, p=p, show_contracted=True)

plt.axhline(y=y, color='b', linestyle='--')

plt.show(block=True)

#you can delete the line below if you dont use jupiter notebook.

%matplotlib inline

################################

# EDA

################################

print(df.head())

'''

Murder Assault UrbanPop Rape

Alabama 13.2 236 58 21.2

Alaska 10.0 263 48 44.5

Arizona 8.1 294 80 31.0

Arkansas 8.8 190 50 19.5

California 9.0 276 91 40.6

'''

#No missing value

print(df.isnull().sum())

'''

Murder 0

Assault 0

UrbanPop 0

Rape 0

dtype: int64

'''

#No categorical variable

print(df.info())

'''

Index: 50 entries, Alabama to Wyoming

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Murder 50 non-null float64

1 Assault 50 non-null int64

2 UrbanPop 50 non-null int64

3 Rape 50 non-null float64

dtypes: float64(2), int64(2)

memory usage: 2.0+ KB

'''

#When we apply IQR analysis with 20% and 80%

#it seems there is no outliers.

for col in df.columns:

print(check_outlier(df,col,q1=0.20, q3=0.80, print_num=True))

'''

False

False

False

False

'''

################################

# Hierarchical Clustering

################################

#NOTE: You should ALWAYS apply SCALING before clustering

#since clustering algorithms works by calculating distances.

#So these distances should be in the same range!

sc = MinMaxScaler()

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)

plot_dgram(df, method="average") # IMAGE IS BELOW (dendrogram_average.png)

Now, let’s use ‘ward’ distance method and plot dendrogram again:

`sc = MinMaxScaler()`

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)

plot_dgram(df, method="ward") # IMAGE IS BELOW (dendrogram_ward.png)

Can we decide the number of clusters by looking at the dendrogram? For example, let’s look at the dendrogram plotted with the ‘ward’ method above. First of all, the length of the dendrogram is approximately equal to 3.6 because the 1st horizontal line is at that distance. If the distance between two consecutive horizontal lines is more than approximately 10% of the length of the dendrogram, a dashed line is drawn between those two consecutive horizontal lines, and the number of clusters can be decided according to how many vertical lines are intersected the dashed line.

For example, for the above plot, 10% makes 3.6/10 = 0.36. Now, there is a distance of 3.6–2.0 = 1.6 between the 1st horizontal line and the 2nd horizontal line, so a dashed line can be drawn between the 1st horizontal line and the 2nd horizontal line since 1.6 > 0.36:

`sc = MinMaxScaler()`

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)

plot_dgram(df, method="ward", y = 3.0) # image is below

Since the dashed line above intersects 2 vertical lines, the number of clusters can be 2.

Alternatively, if we calculate the distance between the 3rd horizontal line and the 4th horizontal line, we find approximately 1.6–0.9 = 0.7. Since 0.7 > 0.36, a dashed line can be drawn between the 3rd horizontal line and the 4th horizontal line:

`sc = MinMaxScaler()`

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)

plot_dgram(df, method="ward", y = 1.2) # image is below

In this case, the dashed line intersected 4 vertical lines. Therefore, the number of clusters can be selected as 4.

Now let’s try these number of clusters and see Silhouette scores!

`#Hierarchical Clustering(n=2)`

cluster = AgglomerativeClustering(n_clusters=2, linkage="ward")

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)clusters_hcluster = cluster.fit_predict(df)

#cluster_hcluster is not a dataframe, it is a numpy array.

#it contains the cluster of each row in the dataset.

print(clusters_hcluster[:5]) # [1 1 1 0 1]

#now let's calculate silhouette score.

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

silhouette_avg = silhouette_score(df, clusters_hcluster)

print("Hierarchical Clustering(n=2) Silhouette Score:", silhouette_avg) # 0.540

**Hierarchical Clustering (n=2) Silhouette Score:** 0.540

Since 0.54 > 0.5, we can say that when n=2, there is a reasonable structure.

Let’s try a different number of cluster, n=4

`#Hierarchical Clustering(n=4)`

cluster = AgglomerativeClustering(n_clusters=4, linkage="ward")

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

df = sc.fit_transform(df)clusters_hcluster = cluster.fit_predict(df)

#cluster_hcluster is not a dataframe, it is a numpy array.

#it contains the cluster of each row in the dataset.

print(clusters_hcluster[:5]) # [3 1 1 0 1]

#now let's calculate silhouette score.

df = pd.read_csv("datasets/USArrests.csv", index_col=0)

silhouette_avg = silhouette_score(df, clusters_hcluster)

print("Hierarchical Clustering(n=4) Silhouette Score:", silhouette_avg) # 0.195

**Hierarchical Clustering (n=4) Silhouette Score:** 0.195

Since 0.195 < 0.25 , no substantial structure has been found.

When we compare n=2 and n=4 for hierarchical clustering, we come to the conclusion that n=2 is better.

**NOTE :** *Sometimes, due to the job description, we may need to divide the data into a certain number of clusters. In such cases, even if the Silhouette Score is low, we should divide the dataset into a defined number of clusters.*

## K-Means

*Before continuing, I recommend you watch the video below:*