In this blog post, we will be discussing the importance of using performance metrics like Mean Absolute Error (MAE) to evaluate machine learning models.
Definition: Mean Absolute Error(MAE) is a measure of errors between paired observations expressing the same phenomenon. It is the average of the absolute errors. The MAE units are the same as the predicted target, which is useful for understanding whether the size of the error is of concern or not. MAE is the aggregated mean of these errors, which helps us understand the model performance over the whole dataset. The mean absolute error is one of a number of ways of comparing forecasts with their eventual outcomes.
How to interprete The Mean Absolute Error(MAE) of your model: The smaller the MAE the better the model’s performance. The closer MAE is to 0, the more accurate the model is. But MAE is returned on the same scale as the target you are predicting for and therefore there isn’t a general rule for how to interpret ranges of values. The interpretation of your value can only be evaluated within your dataset.
MAE can, however, be developed further by calculating the MAPE (Mean Absolute Percentage Error), which is the MAE returned as a percentage. This can make it easier to interpret model performance and compare values across datasets.
Definition: Mean Absolute Percentage Error (MAPE) is the mean of all absolute percentage errors between the predicted and actual values. It is a metric that defines the accuracy of a forecasting method. It represents the average of the absolute percentage errors of each entry in a dataset to calculate how accurate the forecasted quantities were in comparison with the actual quantities. MAPE is often effective for analyzing large sets of data and requires the use of dataset values other than zero.
How to interprete The Mean Absolute Percentage Error(MAE) of your model: MAPE is the average percentage difference between predictions and their intended targets in the dataset. To explain further, if your MAPE is 10% then your predictions are on average 10% away from the actual values they were aiming for.
The dataset used in this post: The dataset is a housing dataset presented by De Cock (2011). The data came to him directly from the Ames City Assessor’s Office in the form of a data dump from their records system. The original Excel file contained 113 variables describing 3970 property sales that had occurred in Ames, Iowa between 2006 and 2010. However, so that the dataset could be used as a “layman’s” data set that could be easily understood by users at all levels he removed any variables that required special knowledge or previous calculations for their use. Most of these deleted variables were related to weighting and adjustment factors used in the city’s current modelling system.
The dataset contains 2930 records (rows) and 82 features (columns) and here, we find the description of the columns which will be used to predict our target column which is Sales Price i.e the amount the apartment or house sell for considering different conditions.
Models for Prediction
The type of models to be used will be a Regression model as we are looking to predict a continuous numerical value.
I choose 6 Regression Machine learning models which I will briefly discuss below.
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients. A low alpha value can lead to over-fitting, whereas a high alpha value can lead to under-fitting. In scikit-learn, a ridge regression model is constructed by using the Ridge class.
K-Nearest Neighbour (KNN) algorithm is a simple and effective at certain tasks. A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbours. Another approach uses an inverse distance weighted average of the K nearest neighbours. KNN regression uses the same distance functions as KNN classification.
Support Vector Regression is a supervised learning algorithm that is used to predict discrete values. The basic idea behind SVR is to find the best fit line. In SVR, the best fit line is the hyperplane that has the maximum number of points.
Decision Tree are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. I love Decision Trees because they are simple to understand and to interpret, trees can be visualised and requires little data preparation.
Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model.
Neural Network MLP A multilayer perceptron (MLP) is a deep, artificial neural network. It is composed of more than one perceptron. They are composed of an input layer to receive the signal, an output layer that makes a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers that are the true computational engine of the MLP. MLPs with one hidden layer are capable of approximating any continuous function.
Multilayer perceptrons are often applied to supervised learning problems. They train on a set of input-output pairs and learn to model the correlation (or dependencies) between those inputs and outputs. Training involves adjusting the parameters, or the weights and biases, of the model in order to minimize error. Back propagation is used to make those weigh and bias adjustments relative to the error, and the error itself can be measured in a variety of ways, including by root mean squared error (RMSE).
I wrote about some Deep learning models built on MLP here.
Like I said previously, I evaluated my models with MAE and MAPE and plotted the errors for ease of interpretation. See below.
So far from the image above, SVR had the lowest error and KNN had the highest.
You can find the code in my Github here.