![](https://crypto4nerd.com/wp-content/uploads/2023/02/1yRmEIN8iEkIwu79nGuWGLg-1024x250.png)
The Decision Tree is a Machine Learning Algorithm that has to ability to deal with both categorical and continuous data that can be used to predict regression and classification values. It is a binary tree with the root node and leaf nodes. The tree starts from the root node and moves down the tree based on the if-else condition. There are various algorithms used to develop the decision tree namely.
a. CART
b. ID 3
c. CHAID
d. ID 4.5
The main parameters that play an important role in developing the tree are Entropy and Gini Index. Entropy describes the amount of information needed to accurately describe data. So, if data is homogeneous then entropy is 0 (that is pure), else if elements are equally divided then entropy move towards 1 (that is impure).
The Gini index of value 0 means samples are perfectly homogeneous, Gini index of value 1 means maximal inequality among elements. It is the sum of the square of the probabilities of each class.
To illustrate the application of Logistic regression, SUV dataset is considered. The Dataset consists of 5 columns and 400 rows, above are the names of the columns and their description.
Dataset Link: https://www.kaggle.com/datasets/iamaniket/suv-data
The objective is to determine what category of people will purchase the SUV or not. We performed an initial analysis of the data by scanning for null values but did not find any. To understand if multi collinearity exists a correlation matrix is developed, ‘Age’ and ‘Estimated Salary’ have a good correlation with our target variable ‘Purchased’ hence they are considered as features for prediction.
Further, the dataset is split into target variable Y (Purchased) and features X (Age and Estimated Salary) and further subdivided into training (75%) and test (25%) sets.
The model is then set to be trained on the dataset and the predictions can be made. As accuracy is not the right metric to judge the model, we use the classification report and the confusion matrix.
From the confusion matrix, we can understand that the model has 61 times predicted True Positive values and 29 times predicted True Negative values. The Accuracy can be calculated as below:
TP = 61, FP = 7, FN = 3 and TN = 29
Accuracy = (TP+TN) / (TP + FP + FN + TN) = (61+29) / (61+7+3+20) = 90/100 = 90%
A classification report can be generated based on the confusion matrix. We carefully notice the model has done a great job in predicting which is 88%(Precision) and the model has correctly classified positive values which is 90% (Recall).
The scatter plot would explain to us how well the model has performed. The Model was able to properly classify the yellow and green points. The boundaries are developed based on the levels of the tree. The more the levels the more the decision boundaries are created, the process of increasing the depth of the tree is called pre-pruning and it helps in avoiding over fitting of the dataset.
The plots show the performance of the model of the data set at level 4, at which the model was able to perfectly categorize class “0” and class “1” with only a few misinterpretations or wrong predictions.
Comparing the results using Entropy and Gini Index as a criterion to build the tree, both seem to yield the same results.
From both the tree plot we can deduce that the splitting is based on the Gini index and entropy value, that is the smaller values go on the left side and the higher value on the right. The plot also shows how many samples are remaining in each split. As the tree is pre-pruned the depth is controlled to four.
https://ankitnitjsr13.medium.com/decision-tree-algorithm-id3-d512db495c90