In the previous part of the Machine Learning series, I gave information on Basic Concepts. In this section, we will introduce the models. Of course, the first thing we will learn is the linear regression model, which is one of the simplest models to understand. Let me warn you from the beginning, although it may seem simple to understand, our work does not end with a basic understanding of the model. While conveying the details, we will see that things can get complicated even in seemingly simple situations.
Linear Regression
Regression is a model used when the dependent variable, that is, the value we are trying to predict, is a numerical variable. It is called linear because it assumes that there is a proportional relationship between the dependent variable and the independent variables. We can explain what this means through the model itself:
In the above formula, Y is the dependent variable and X is the independent variable. B0 is the constant and B1 is the coefficient of X. Let’s go through an example. We are trying to estimate house prices and we want to make an estimation based on the area of the house in square meters. In this case;
Y -> House price (TL)
X -> Size of the house (sq. meters)
In order to see the relationship between Y and X, let’s assume we have a small data set as follows:
When we examine the data set, we see that the price of the house increases as the size of the house increases. If this increase is linear, that is, if we think that the price of the house increases with a constant B1 coefficient when the size of the house increases by 1 m2, using linear regression will be sufficient to estimate the price of the house.
To find the B0 and B1 values, it is first necessary to understand what success is in a machine learning model. Because we adjust the coefficients so that we will show the highest performance according to the relevant success criteria. In the first article, I mentioned which success criteria we use when the independent variable is numeric. Of course, nowadays these values are no longer calculated manually. Optimum B0 and B1 values are found according to a success criterion introduced to the system. The criterion for success in linear regression is the “Residual sum of squares” metric. Our goal is to find the B0 and B1 values that will minimize the following value:
For n different observations (for example, n = 10 for the dataset above), we try to get the lowest possible value of the sum of the differences between the actual values and the predicted values. Using the least squares method, we arrive at the values of B0 and B1. The formula for reaching these values is as follows. In fact, after taking the derivative of the RSS formula for both B0 and B1, the equation with 2 unknowns is solved and the result is reached.
For our example above, when we build our model, an equation like the one below will appear:
Y = 784 + 13X
Let us briefly interpret this equation. Even when the square meter of the house is very close to 0, the house has a base price. Each square meter increase affects the price of the house by 13 units. In other words, we expect a 55 m2 house to be 13.000 TL higher than a 54 m2 house. We can make an estimation and compare for a house that has similar features to the houses in the above dataset, but whose price we do not know, and whose size we know in m2. Although it may seem very simple, such a model will probably yield more successful results than we expected for houses that are close to each other and have similar other parameters.
Speaking of results, let’s also see how much difference there is between our success criterion, the actual and the estimation:
We did not get a very bright result for this example. This happened because I completely made up this data. When we look at the average of squares of error, we see a value well above the house prices. Normally, it would be more appropriate to compare the success of the model with other models, but we can compare the mean of squares of error with the mean of the values in our dataset to get some idea of how consistent our model is. Our expectation is at least that the mean squared error will be lower. For the same problem, the success performances of the models can be compared using a method other than linear regression (eg decision trees).
If you ask how we can improve the success performance of the existing model, then the thing to do is to increase the complexity of the model. In other words, we can increase the predictive power by adding new variables to the model. But it has a drawback. It is difficult to explain a more complex model to the regular people. Sometimes you may find it difficult to explain the model even to yourself. If you’re not looking for meaning, you can still opt for complexity without losing the edge. Suppose that you don’t need to explain that there is a B1 unit price increase for 1 m2 increase and you don’t care exactly how the increase is experienced. If you say that my only goal is to estimate the house price more accurately, add the variables you want. If you only have m2 information, you can observe how much the model has developed by adding the square, cube and beyond to the model together with itself (that is, with X).
The example I gave earlier was the “simple” linear regression example with one variable. If there is more than one information such as how many rooms there are, what floor the house is on etc. to estimate the house price, then you can build a “multiple” linear regression model. There is no difference in the success evaluation method. The solution method for finding the coefficients is also the same (the least squares method).
In multiple linear regression, if there is a correlation between the independent variables, the relationship between the dependent variable and the independent variable may not be observed in a healthy way. For example, while the coefficient (B1) of an independent variable (let’s say X1) in our model is negative, that is, there seems to be a negative relationship with the dependent variable, its coefficient can be positive when we construct a model solely using X1. We cannot directly comment on the first model. In other words, as I mentioned above, the explainability of the model will decrease as the model becomes more complex.