Hierarchical Clustering with Python: Basic Concepts and Application | by muratgulcan

This method aims to group elements in a data set in a hierarchical structure based on their similarities to each other, using similarity measures. The hierarchical clustering method is used in cases where the number of clusters is not predetermined and to understand hierarchical structures in the data set. It is also advantageous in terms of visualization, because groupings and subgroupings in the dataset can be clearly observed thanks to the dendrogram.

Hierarchical clustering is based on two main approaches:

a) Agglomerative Clustering: In this approach, each data item is initially considered as a single cluster. Then, using similarity measures, the two closest clusters are combined and transformed into a larger cluster. This aggregation process continues until all data items are in a single large set.

The working logic of the algorithm can be summarized with the following steps:

a) Initially, each item is considered a set by itself.
b) A similarity or distance matrix is created between all items. This matrix shows the distance or similarity between each pair of items.
c) The closest two clusters (or elements) are found and a new cluster is formed by combining these two clusters. In this step, different linkage methods can be used depending on different variations of the clustering method. The most common attachment methods are:
– Ward’s method: Combining two sets takes place in such a way as to increase the sum of squared errors the least.
– Complete linkage: Based on the distance of the furthest items between two sets.
– Single linkage: The distance of the closest items between two clusters is taken as a basis.
– Average linkage: Based on the average distance of all items between two clusters.
d) Step 3 is repeated until only one cluster remains in the dataset. This process creates a clustering structure in which items are grouped into a hierarchical tree structure.

The resulting hierarchical clustering structure is often visualized as a dendrogram. A dendrogram is a tree chart showing step-by-step cluster aggregation operations.

b) Segmentative Clustering: In this approach, the entire dataset is initially considered as a single large cluster. Then, the items within the set are divided into subsets based on their similarity. This division process continues such that each subset groups the most similar items within itself.

Both methods create a tree structure, so the method is also known for tree structure. This tree structure shows how the items in the dataset are grouped and how similarities change at subset levels.

Distance matrix is generally used for hierarchical clustering.
The distance matrix is a matrix that contains the distances (measures of difference) between data points. This distance matrix represents the similarities or differences between objects. Usually, the distance between two data points is calculated with metrics such as Euclidean distance, Manhattan distance, Correlation coefficient, Mahalanobis distance.

In this application we will use the euclidean distance, the equation of the euclidean distance is as follows:

where d represents the Euclidean distance between two points.
(x₁, y₁) and (x₂, y₂) indicate the coordinates of two points.

This study explains how to create a step-by-step application project using the Python programming language. First of all, the requirements of the project were meticulously reviewed and definitions were made in accordance with the purpose of the project. Elements such as definitions, functions, variables and classes are coded in accordance with the Python language. NumPy, Matplotlib, SciPy and Scikit-Learn libraries are used. At this stage, when any bugs or malfunctioning functionality were detected, corrections and improvements were made to the codes.

Libraries that will help analyze the data are included in the project first.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

Six random data points are generated.

np.random.seed(42)
data_points = np.random.rand(6, 2)

The dataset is visualized by following the following line of code.

colors = ['r', 'g', 'b', 'c', 'm', 'y']
plt.figure(figsize=(8, 6))
for i in range(len(data_points)):
plt.scatter(data_points[i, 0], data_points[i, 1], color=colors[i], label=f'Data Point {i+1}')
plt.legend()
plt.title('Data Points')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid(True)
plt.show()

The result obtained when the dataset is visualized is as shown in Figure 1 (the graph you created may be a different result).

Relevant functions are included and graphed to create a distance matrix.

# Calculate distance matrix
distance_matrix = squareform(pdist(data_points))# Visualization distance matrix
fig, ax = plt.subplots()
cax = ax.matshow(distance_matrix, cmap='viridis')
# Distance values for matrix cells
for i in range(len(data_points)):
for j in range(len(data_points)):
ax.text(j, i, f'{distance_matrix[i, j]:.2f}', ha='center', va='center', color='white')
# Add color bar
cbar = fig.colorbar(cax)
cbar.ax.set_ylabel('Distance', rotation=270, labelpad=15)
# Show chart
plt.title('Distance Matrix')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

The resulting graph is as shown in Figure 2.

Finally the dendrogram graph is drawn. Dendrograms are used as a visual representation of results from hierarchical clustering analyses. Generating this type of graph is an important tool for understanding similarities or differences between items in data sets. It helps to clearly observe the similarities between the items in the analyzed dataset and the relationships between the groups.

Z = linkage(condensed_distance_matrix, method='single')plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.title('Hierarchical Cluster Dendrogram')
plt.xlabel('Data Point Indexes')
plt.ylabel('Distance')
plt.show()

The resulting graph is as shown in Figure 3.

Figure 3. Hierarchical Cluster Dendrogram

CONCLUSION

In this application, we examined the basic concepts of hierarchical clustering and examined its application using Python. Hierarchical clustering is a powerful unsupervised learning technique that allows us to identify natural groupings in our data without the need for labeled examples.

One of the key advantages of hierarchical clustering is the ability to create a hierarchy of clusters that allows us to visualize the structure of the data at different levels of detail. This dendrogram representation helps interpret relationships between different clusters and assists in making informed decisions about the number of clusters to choose.

Throughout the study, we used popular data science libraries such as Python’s NumPy, SciPy, and Matplotlib to effortlessly implement hierarchical clustering. These libraries provide efficient functionality for cluster analysis and visualization, streamlining the implementation process.

However, it is important to note that hierarchical clustering may not always be the best option for every dataset. Depending on the size and dimensionality of the data and the particular problem at hand, other clustering algorithms such as K-means or DBSCAN may yield better results.

Source link