Decision Trees for Classification | by Santiago Arboleda Quiroz

Decision trees

Decision trees are also popularly known as CART (which stands for classification and regression trees).

Each decision tree includes a root node, some branches, and leaf nodes. The internal nodes present within the tree describe the different test cases. Decision trees can be used to solve both classification and regression problems. The algorithm can be considered as a tree-like graphical structure that uses various tuned parameters to predict the results. Decision trees apply a top-down approach to the data set that is fed during training.

The decision tree algorithm works as a set of nested if-else statements in which successive conditions are checked unless the model reaches a conclusion.

The decision nodes or simply nodes of the tree are the questions that the tree presents after passing each node (starting with the root node). A branch or subtree is a subsection of the entire tree. Each edge of the tree corresponds to the result of the question and the result is represented by a leaf node or a terminal node that represents the class distribution.

¿How are decision trees used in classification?

The decision tree algorithm uses a data structure called a tree to predict the outcome of a particular problem. Since the decision tree follows a supervised approach, the algorithm is fed a collection of preprocessed data. This data is used to train the algorithm.

Decision trees follow a top-down approach, meaning that the root node of the tree is always at the top of the structure, while the results are represented by the leaves of the tree. Decision trees are built using a heuristic called recursive partitioning (commonly known as Divide and Conquer). Each node following the root node is divided into multiple nodes.

The key idea is to use a decision tree to divide the data space into dense regions and sparse regions. Splitting a binary tree can be binary or multidirectional. The algorithm continues splitting the tree until the data is sufficiently homogeneous. At the end of training, a decision tree is returned that can be used to make optimal categorized predictions.

An important term in the development of this algorithm is Entropy. It can be considered as the measure of uncertainty of a given data set and its value describes the degree of randomness of a particular node. This situation occurs when the margin of difference of a result is very low and, therefore, the model has no confidence in the accuracy of the prediction.

The higher the entropy, the greater the randomness in the data set. When building a decision tree, a lower entropy will be preferred.

Another metric used for a similar purpose is the Gini index. Use the Gini method to create split points. Information gain is the metric generally used to measure the reduction of uncertainty in the data set.

Divisions in a decision tree

As the number of splits in a decision tree increases, the time required to build the tree also increases. However, trees with a large number of splits are prone to overfitting, resulting in poor accuracy. However, this can be managed by deciding on an optimal value for the max_depth parameter. As the value of this parameter increases, the number of splits also increases.

Advantages of decision tree algorithm

Extremely fast sorting of unknown records.
It ignores features that are of little or no importance in the prediction.
Extremely efficient, provided the parameters are set optimally.
Economical to build with easy to interpret logic.

Limitations of the decision tree algorithm

Decision tree classifiers often tend to overfit the training data.
Changes to the data can cause unnecessary changes to the result.
Large trees can sometimes be very difficult to interpret.
These are biased towards divisions into features that have multiple levels.

Now, let’s execute an example with data of bank campaign. The objective is to predict if the campaign is effective with the client.

Let’s import libraries:

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, recall_score, precision_score
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)# Balanceo de variable objetivo
from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('banking.txt', sep=',')
df.head()

df.info()

Data columns (total 21 columns):
#   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
0   age             41188 non-null  int64  
1   job             41188 non-null  object 
2   marital         41188 non-null  object 
3   education       41188 non-null  object 
4   default         41188 non-null  object 
5   housing         41188 non-null  object 
6   loan            41188 non-null  object 
7   contact         41188 non-null  object 
8   month           41188 non-null  object 
9   day_of_week     41188 non-null  object 
10  duration        41188 non-null  int64  
11  campaign        41188 non-null  int64  
12  pdays           41188 non-null  int64  
13  previous        41188 non-null  int64  
14  poutcome        41188 non-null  object 
15  emp_var_rate    41188 non-null  float64
16  cons_price_idx  41188 non-null  float64
17  cons_conf_idx   41188 non-null  float64
18  euribor3m       41188 non-null  float64
19  nr_employed     41188 non-null  float64
20  y               41188 non-null  int64  
dtypes: float64(5), int64(6), object(10)
memory usage: 6.6+ MB

Data exploration

df['education'].unique()### Result
array(['basic.4y', 'unknown', 'university.degree', 'high.school',
'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
dtype=object)

The education column has many values, we can try to group them

df['education']=np.where(df['education'] =='basic.9y', 'Basic', df['education'])
df['education']=np.where(df['education'] =='basic.6y', 'Basic', df['education'])
df['education']=np.where(df['education'] =='basic.4y', 'Basic', df['education'])
df['education'].unique()### Result
array(['Basic', 'unknown', 'university.degree', 'high.school',
'professional.course', 'illiterate'], dtype=object)

Let’s analyze the target variable

df['y'].value_counts()### Result
y
0    36548
1     4640

sns.countplot(x='y', data=df, palette='hls')
plt.show()

count_no_sub = len(df[df['y']==0])
count_sub = len(df[df['y']==1])
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print("Percent no subscriptión", pct_of_no_sub*100)
pct_of_sub = count_sub/(count_no_sub+count_sub)
print("Percent subscriptión", pct_of_sub*100)### Result
Percent no subscriptión 88.73458288821988
Percent subscriptión 11.265417111780131

The classes are unbalanced. We must treat the variable before passing the data through a Machine Learning algorithm.

Visualizations

Let’s do univariate analysis

%matplotlib inline
pd.crosstab(df.job,df.y).plot(kind='bar') 
plt.title('Frecuencia de compra para el puesto de trabajo') 
plt.xlabel('Trabajo') 
plt.ylabel('Frecuencia de Compra')

The frequency of purchasing the deposit depends largely on the job. Therefore, the job position can be a good predictor of the outcome variable.

table=pd.crosstab(df.marital,df.y) 
table.div(table.sum(1).astype(float),axis=0).plot(kind='bar', stacked=True) 
plt.title ('Gráfico de barras apiladas del estado civil frente a la compra') 
plt.xlabel('Estado civil') 
plt.ylabel('Proporción de clientes')

Marital status does not appear to be a strong predictor of the outcome variable.

table=pd.crosstab(df.education,df.y) 
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True) 
plt.title ('Gráfico de barras apiladas de educación versus compra') 
plt.xlabel('Educación') 
plt.ylabel('Proporción de clientes')

Education seems to be a good predictor of the outcome variable.

pd.crosstab(df.day_of_week,df.y).plot(kind='bar') 
plt.title('Frecuencia de compra para el día de la semana') 
plt.xlabel('Día de la semana') 
plt.ylabel('Frecuencia de compra')

The day of the week may not be a good predictor of outcome.

pd.crosstab(df.month,df.y).plot(kind='bar') 
plt.title('Frecuencia de compra por mes') 
plt.xlabel('Mes') 
plt.ylabel('Frecuencia de compra')

The month could be a good predictor of the outcome variable.

df.age.hist() 
plt.title('Histograma de edad') 
plt.xlabel('Edad') 
plt.ylabel('Frecuencia')

pd.crosstab(df.poutcome,df.y).plot(kind='bar') 
plt.title('Frecuencia de compra para Poutcome') 
plt.xlabel('Poutcome') 
plt.ylabel('Frecuencia de compra')

The outcome appears to be a good predictor of the outcome variable.

We create dummy variables, that is, variables with values of 0 and 1 for the categorical variables.

cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
data = df
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(data[var], prefix=var)
data1=data.join(cat_list)
data=data1
cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

data_final=data[to_keep]
data_final.columns.values

We can do an analysis of the correlation coefficients, looking for a linear relationship between the predictor variables and the target variable.

mask = np.tril(data_final.corr())
fig, ax = plt.subplots(figsize=(60,20))   
sns.heatmap(data_final.corr(), fmt='.1g', annot=True, cmap= 'cool', mask=mask)

correlation = data_final.corr()['y'].to_frame()
correlation[(correlation['y']>0.1)|(correlation['y']<-0.1)]

Oversampling

With our training data created, we will upsample the unsubscribe using the SMOTEEN algorithm and SMOTETomek (Synthetic Minority Oversampling Technique). At a high level, SMOTE:

It works by creating synthetic samples of the minor class (without subscription) instead of creating copies.
Randomly choose one of the k nearest neighbors and use it to create new observations that are similar, but randomly modified.
In the case of Smoteen, it combines oversampling with undersampling and SmoteTomek, combines Smote with Tomek links to eliminate noise from the synthetic examples created.

X = data_final.loc[:, data_final.columns != 'y'] 
y = data_final.loc[:, data_final.columns == 'y']# We work with partitioned data in order test data don't have noise
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) 
columns = X_train.columns

smoteenn = SMOTEENN()
X_smoteenn, y_smoteenn = smoteenn.fit_resample(X_train, y_train)

sns.countplot(x='y', data=pd.DataFrame(y_smoteenn), palette='hls')
plt.show()

We can see that compared to what was initially available, the target variable is more balanced, but there is still an imbalance. Let’s try StomeTomek

# SMOTETomek - permite eliminar el ruido generado por los ejemplos sintéticos creados
smt = SMOTETomek()
X_smotetomek, y_smotetomek = smt.fit_resample(X_train, y_train)

sns.countplot(x='y', data=pd.DataFrame(y_smotetomek), palette='hls')
plt.show()

Now our class is fully balanced and we are ready to begin modeling.

Training

Recursive feature elimination (RFE) is based on the idea of repeatedly building a model and choosing the feature with the best or worst performance, leaving the feature aside, and then repeating the process with the rest of the features. This process is applied until all features in the data set are exhausted. The goal of RFE is to select features by recursively considering increasingly smaller feature sets.

estimator: refers to a machine learning estimator with a fit method, which provides information about the importance of the features n_features_to_select: number of features to select

step: is the number of features that are removed in each cycle, if it is equal to or greater than 1. If it is between 0 and 1, it corresponds to the percentage of features to be removed.

importance_getter: if ‘auto’, uses the feature importance via the regressor.coef_ or feature_importance_ estimator.

model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
rfe = RFE(estimator = model, n_features_to_select = 20, step=1)

data_final_vars=data_final.columns.values.tolist()
y=['y']
X=[i for i in data_final_vars if i not in y]rfe = rfe.fit(X_smotetomek, y_smotetomek.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

cols = rfe.get_feature_names_out()
X=X_smotetomek[cols]
y=y_smotetomek['y']

mask = np.tril(X.corr())
fig, ax = plt.subplots(figsize=(60,20))   
sns.heatmap(X.corr(), fmt='.1g', annot=True, cmap= 'cool', mask=mask)

x_test_data = X_test
y_test_data = y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=0)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(model.score(X_test, y_test)))### Result
Accuracy of logistic regression classifier on test set: 0.90

confusion_matrix = confusion_matrix(y_test, y_pred)

from mlxtend.plotting import plot_confusion_matrixfig, ax = plot_confusion_matrix(conf_mat=confusion_matrix, figsize=(6, 6), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

print(classification_report(y_test, y_pred))

precision    recall  f1-score   support0       0.93      0.86      0.89      7622
1       0.87      0.94      0.90      7599
accuracy                           0.90     15221
macro avg       0.90      0.90      0.90     15221
weighted avg       0.90      0.90      0.90     15221

print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-Score:', f1_score(y_test, y_pred))### Result
Precision: 0.8678946084410832
Recall: 0.9363074088695881
F1-Score: 0.9008039501171109

print(recall_score(y_test, y_pred, average='macro'))
print(recall_score(y_test, y_pred, average='micro'))
print(recall_score(y_test, y_pred, average='weighted'))### Result
0.8971093591186041
0.8970501281124762
0.8970501281124762

logit_roc_auc = roc_auc_score(y_test, model.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (towards the top left corner).

Source link