Unraveling Decision Trees: Understanding the Power of Gini Impurity | by Muzayan Islam

A Comprehensive Guide to Effective Splitting Methods in Decision Tree Algorithm

This article explains how we can use decision trees for classification problems. We’ll start by explaining key ideas and then dive into different ways of splitting data, like Gini Impurity. With simple examples and clear explanations, we’ll walk through building decision trees, making it easier to understand.

Introduction:

A decision tree is a hierarchical model that organizes data into a tree-like structure to make decisions based on input features. It is nothing but a giant structure of nested if-else condition.

It is one of the most popular machine learning algorithms that belongs to the class of supervised learning where it can be used for both classification and regression purposes.

In this article we will focus on classification problems. Let’s consider the following example data:

Using the decision tree, we can classify iris flowers among three species (setosa, versicolor or virginica) from measurements of petal and sepal length.

By using this example, we can predict the variety of the flower. Before discussing how to construct a decision tree, let’s have a look at the resulting decision tree for our example data.

Geometric intuition:

By following the path, we can come to a decision. We can see from our data if the petal length is less than 2, the flower is Setosa and if not, we will focus on other features like if the sepal length is less than 1.5 the flower is Versicolor and if not then its Virginica.

Let’s define some important terminologies.

The Root nod is the top node where the decision is taken. Branches are the arrows connecting the nodes. Leaf node is the node after which no further splitting is possible. Parent node is the main node in a decision tree that makes the first decision.

Formulation of a decision Tree:

The important step in the creation of a decision tree is the splitting of the data. There are different methods that can be used to find the next split which includes:

· Entropy

· Information Gain

· Gini Impurity

We will concentrate on one of them: the Gini Impurity which is a criterion for categorical target variables. There are different criteria in order to find the next split. To read more about these see here.

Gini Impurity:

Gini impurity measures the likelihood of misclassifying a randomly chosen element in a dataset, with lower values indicating more purity or homogeneity in the dataset. It’s used as a criterion to find the best split when growing a decision tree.

It is calculated as follow:

Gini Impurity Fromula (Created by Author)

The steps involved to calculate the Gini impurity includes:

· Calculate the Gini impurity for each feature in the dataset.

· Calculate the Gini impurity for each subset as the weighted average Gini Impurity of child nodes.

· Choose the feature that results in the lowest Gini impurity as the splitting feature.

Let’s understand it with the help of an example:

Example of Decision Tree with binary data:

The initial step involves determining the feature to be chosen as the root node. This selection is based on the calculation of the Gini impurity for each feature. The feature with the lowest Gini impurity is deemed suitable as the root node. To proceed, we will compute the Gini impurity for both features and assess their suitability as potential root nodes.

The tree for the above dataset is:

The calculation for the tree is:

For Likes Stitching:

Dataset 1:

Dataset 2:

The weighted average Gini impurity of both:

For interests in Fashion:

Dataset 1:

Dataset 2:

The weighted average Gini impurity of both:

From the above calculations we can see that the Gini impurity for interest in Fashion is less which is 0.17, so it is considered as a root node. In this example, only one feature remains, and we can build the final decision tree.

Understanding Decision Tree with Practical Implementation

The input data consists of three independent variables: ‘Temperature’, ‘Humidity’, and ‘Wind Speed’, along with one target variable: ‘Rainfall’. The target variable ‘Rainfall’ is a binary outcome denoting whether it rained on a particular day or not.

After importing the necessary libraries. We separated the input features and stored them in a variable name X and output in y.

A Decision Tree Classifier is instantiated and trained on the data.

The trained decision tree is visualized using plot_tree().

#Importing the libraries
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as pltX = df[['Temperature','Humidity','Wind speed']]
y = df['Rainfall']
features=['Temperature','Humidity','Wind speed']
#Fitting the model
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)
#plotting the tree
tree.plot_tree(dtree, feature_names=features)
plt.show()

The tree generated from this code is below:

The range of values Gini Impurity can have between 0 to 0.5. where 0 indicates a perfectly pure node (all instances belong to the same class), and 0.5 signifies maximum impurity (data points belong to multiple classes or categories.).

From the above figure we can see that in each node Gini impurity score is displayed. The decision tree algorithm selects the root node feature and its range by iteratively evaluating impurity measures for all features like we did in the example of Decision Tree with binary data and it select the split that minimizes impurity, thus creating the most effective split for segregating the data into distinct classes.

For example, the root node divides the data based on ‘Wind Speed’ feature, all the data points where Wind Speed is ≤ 11 go to the left side of the tree and greater than 11 goes to right side of the tree. The Gini Index is 0.5 for this split, it defines how good this node splits the data into 2.

The Gini impurity is 0 for leaf node which shows that these are ’pure’ node and no further division is needed. As we have already studied that the feature with the smallest Gini Impurity is selected for splitting the node.

Pruning

Pruning is an important technique to prevent overfitting in decision trees. By selectively removing nodes or sub-nodes that are not significant, pruning enhances the model’s performance and generalization ability. It involves two primary approaches:

1) Pre-pruning

2) Post-pruning.

In pre-pruning, the growth of the tree halts prematurely, preventing the addition of nodes with low importance during tree construction. Conversely, post-pruning occurs after the tree has been fully developed to its maximum depth. At this stage, nodes are pruned based on their importance, effectively trimming branches with minimal relevance.

Basically pruning serves as a strategic mechanism to streamline decision trees, ensuring that only the most informative features are retained. By striking a balance between complexity and predictive power, pruning fosters models that are more robust and less susceptible to overfitting, thereby improving their performance in real-world scenarios.

You can read about pruning in details here.

Conclusion:

In this article, we explored the construction of decision trees for classification problems, focusing on finding the optimal split of the data. We introduced the concept of Gini Impurity as a method for determining the best split and explained it with a practical example, which is commonly implemented in libraries like scikit-learn in Python. However, it’s essential to acknowledge the limitations of decision trees, particularly their susceptibility to overfitting, where they may capture noise in the data rather than true patterns. Thus, while decision trees are powerful tools for classification, careful consideration must be given to prevent overfitting and ensure accurate predictions.

Source link