Ensemble ML algorithms. 1. Relevant Concepts to Ensemble Based… | by Nikhil Sawal

Now we will talk about some commonly used tree based models. We will talk about what is under the hood for each of the following algorithms? how do they work in a classification and regression setting? and finally compare some algorithms:

Random Forest
Gradient Boost
XGBoost
AdaBoost

2.1 What is a random forest?

Random forest is a popular ensemble machine learning model. Ensemble methods combine the predictions of multiple base models (often referred as weak learners) to create a stronger and more robust model.

The fundamental building block is a decision tree as represented in the figure below, where each parent node (in Yellow) represents a decision or test on a feature column and each leaf node (in Green) represents the output or the prediction of the model.

What makes this model effective are 2 things:

Bootstrapped Data set: Each tree is trained using a randomly sampled dataset from the training data.
Random features: A random subset of features is considered at each split (at each node).

Both these points mean that each tree is grown completely independent of one another.

Advantages of random forest include :- High level of accuracy, Robust, Handles high-dimensional data, Feature importance and works well for both regression and classification.

2.2 How is a random forest created? How do we measure the accuracy of it?

Steps for constructing a random forest:

STEP 1:

Create N bootstrapped dataset.

STEP 2:

Train a decision tree using each of these bootstrapped dataset, but only using a subset of the variables from the data set at each split. So the algorithm is not even allowed to consider a majority of the available predictor variables.

This is clever because, let’s say there is one strong predictor and few moderately strong predictors, for the bagged trees, most of them will use the strong predictors at the top. Consequently all the bagged trees would look similar and hence the prediction for the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.

For eg: If the original dataset has 4 feature , 2 will be selected to identify the feature that gives better split for the root node. For each of the following child nodes, 2 out of the remaining 3 feature (3 because, feature selected for root node is excluded) are randomly selected to grow the tree.

Now, steps 1 & 2 are repeated N times, to create a forest of N trees. To measure the accuracy of the model, averages are taken from each of the trees build in the random forest for regression problems and majority votes for classification problems, which are later evaluated by comparing them with the true values.

For hyper-parameter tuning, we can play with N (number of trees in the forest), p (random number of feature columns used in step ii), number of decision trees, depth of each tree.

To measure the accuracy of the model, depending on the use case:

For classification problem: We can use Precision, Recall, Accuracy, F1, AUC and ROC curve
For regression problem: We can use MAE, MSE, RMSE, R-squared

2.3 What is gradient boosting? What are the steps to create a gradient boosting model in context of regression?

Gradient boosting is very similar to random forest with the exception that each tree is grown sequentially based of the residuals from the previous tree. It is a boosting algorithms as referred in Section 1.

The idea is to use the gradient descent technique to sequentially reduce the loss function (something like Sum of Squared Residuals SSR). Instead of just averaging predictions as in the random forest model, each of the boosted tree receives a weight based of its performance and predictions are computed based of weighted averages.

STEP 1: Initialize the model with a constant value for prediction (Average)

Gradient boosting starts by creating a single leaf, instead of a stump or full-sized tree. The value of the leaf, represents the initial prediction/guess for all instances. For regression problems, its the average of all y-values.

For eg: If we want to predict the weights of athletes, let’s assume that 72 is the average weight for all athletes that are in the training set. So, 72 becomes our initial prediction/guess.

STEP 2: Calculate the residuals

In step 2, residuals are computed based on the initial guess in step 1 (i.e 72) for each instance. A new tree of a given size is constructed, that tries to predict these residuals based off the initial guess in step 1. The final predictions are computed using following formula.

Learning rate is important because it ensures that we take small steps in the right direction. This result in better predictions fo rthe testing dataset i.e. lower Variance. Learning rate takes on a value between 0 and 1.

STEP 3: Repeat

In step 3, we repeat the same procedure as step 2. We use all predictions (y^’s) from step 2 to calculate residuals and then construct a tree that tries to predict these residuals. Later we use the equation in step 2 to make new predictions y^. The learning rate can be interpreted as a step function that helps us reach a minimum sum_of_residuals in small steps. So the idea over here is that, with every new tree, the sum of residuals is minimized as compared to its predecessor.

We keep building trees until we reach the maximum specified, or adding additional trees does not significantly reduce the size of the residuals.

The advantages of this method include: High accuracy, Flexibility, Feature importance, Robust, Good at handling missing values and outliers.

2.4 How does the gradient boosting algorithm work in context of classification?

Gradient boosting for classification is quite similar to that for regression, in that,

STEP 1: Initialize the model with a constant value for prediction (Log Odds)

The initial prediction consists of a leaf node, but instead of taking average of the response as initial guess, we calculate probabilities. The way we do this is, by computing the log(odds) and then using the logistic function to transform it into probabilities as follows:

NOTE: Because we are dealing with just one leaf node as our initial prediction, all instances will have the same value for predictions, i.e.the probability calculated by eq. 3 above.

STEP 2: Calculate the residuals

Just like step 2 in gradient boosting for regression, we compute residuals for each instance based off the probability calculated from equation 3 and build a tree to predict these residual. Each tree is weighted by a learning rate and the idea over here is to take smaller steps in the right direction, similar to the regression scenario. However, since each leaf node of this tree represents residuals, we cannot use the following formula, we used in regression, to get final predictions.

Final prediction computations for Regression

There is a need to convert the residuals into log(odds), which can later be transformed into probabilities using the logistic function in eq(3). Basically we need to transform the e^ij in the above equation into log(odds), for which we use the following formula:

STEP 3:

Just like gradient boosting for regression, we repeat STEP 2, until we hit the number of trees to be created or there is no further scope for the algorithm to improve.

2.5 What are the main ideas behind AdaBoost?

Unlike Random Forest, which builds a forest of fully grown trees, AdaBoost combines a lot of weak learners called ‘Stumps’ (Stump is a tree with one parent node and child nodes) to make predictions.
Some stumps get more weight that the others based of their performance, where as in random forest all trees get the same weights.
Each stump is made by taking into account the errors of the previous stump.

2.6 What are the steps for creating an AdaBoost?

STEP 1: Assign weights to each sample (1 / #Samples)

The algorithms starts by assigning weights to each sample. In the first step each sample gets equal weight. The weights are modified with each tree. Misclassified samples get more weight for the next tree and the ones correctly classified are reduced by a certain amount. The first stump or tree is created by selecting a feature that gives the best split i.e. least GINI impurity measure.

STEP 2: Calculate the amount of say each tree gets

Each tree created in the AdaBoost forest has a certain amount of say in the final prediction. The amount of say that each tree gets in the final outcome is given by the following formula:

STEP 3: Modify sample weights

Once the amount of say is calculated for the first tree, we need to modify sample weights in an attempt to correctly predict misclassified samples. Following formulas are used to modify weights of misclassified samples and correctly classified samples respectively.

The idea behind the above formulas is that if a certain tree has fewer errors, the amount of say, it gets would be a higher number. So, when we are trying to update weights for each sample for the next tree, we want to multiply misclassified samples from the previous tree by a higher number, so that, that sample gets a higher say (or higher weight) in future trees, which is exactly what we get from eq. (3). Similar logic applies for correctly classified samples.

To make sure that the misclassified samples get classified correctly in the subsequent trees, we can create a new dataset of same length by doing sampling with replacement on the original data set.

We can achieve this by taking cumulative of our new normalized sample weights. Because they are normalized their sum will equal to 1 and then when we randomly take any number between 0 and 1, the probability that the randomly picked number will lie in the cumulative range of the misclassified samples will be higher because of the higher weight.

STEP 4: Make final classification

In the end, for any given sample we will have a set of trees predicting Yes and another set predicting No. To get the final classification for this sample we sum the amount of says from each set. The sample is classified Yes if the Total (Amount of Say) for Yes is greater than No and vice versa.

2.7 Compare gradient boosting with AdaBoost?

Like adaBoost, gradient boosting algorithm builds trees, based off previous trees errors. But unlike AdaBoost, which builds a forest of stumps (or weak learners), gradient boosting creates a forest of trees that can be larger than stumps
Like AdaBoost, gradient boosting scales all trees by a scaling parameter the LearningRate, but unlike AdaBoost, all the trees in gradient boosting are scaled by the same amount, whereas in AdaBoost the scaling of each stump is a function of the errors made by the tree.

Basically, AdaBoost employs a weighted approach, where the say that each stump gets in the final outcome is a measure of the error it makes in prediction.

2.8 Explain XGBoost algorithm for regression? How does it compare to Gradient Boosting?

XGBoost is very similar to gradient boosting algorithm in the following ways:

It starts with a leaf node which is a single value that makes the initial predictions.
Later a tree is constructed based off the residuals we get from the initial predictions. To make the final prediction, we use a formula similar to the following, where we have a learning rate that tries to take smaller steps in the right direction.

Later we continue with Step 2, i.e build more trees until we hit the required number of trees or we don’t find any further scope for improvement.

However, the major difference between XGBoost and Gradient boosting is the way, the trees are constructed from step 2.

Unlike gradient boosting which constructs a tree to predict residuals, XGBoost uses a similarity score to make sure that similar residuals or similar observations end up in the same leaf node. The similarity score is computed by using the following formula:

Smaller similarity score implies that the residuals in that node are different and that there is a need to split further.
After every split, Gain is calculated to measure the quality of the split. The formula for gain is:

Gain

A higher value for Gain suggests that splitting a particular leaf node further adds value to the predictions.
Once we have the values for gain, we need to compute the output value for each leaf. That is done using the following formula:

Final predictions are computed as follows:

2.9 How is pruning done in XGBoost?

Pruning in XGBoost is done using a pruning parameter named gamma. From the previous question we learnt that for every parent node in the tree, we calculate Gain to measure the quality of split. Once we have a fully grown tree and we have gain values for each node, we start from the bottom most node and see if subtracting the value of gamma from gain, gives us a positive or a negative value. If the value is positive, no pruning is done, but if otherwise, we continue pruning until we find a node that gives positive value for (gain – gamma).

2.10 How does XGBoost work for Classification?

XGBoost for classification follows on the similar steps as that for Regression, with the following exception:

The initial prediction is 0.5 by default, which is later converted to log(odds).
The similarity scores to group residuals in all trees, is computed as follows: