![](https://crypto4nerd.com/wp-content/uploads/2023/02/1zCY2y_hLvXYP684VYvmcNA.png)
Engineers frequently develop mathematical models by empirically fitting a dataset of Xs to an observational y with various curves and picking the curve with the lowest error or the highest R2 value. The following dataset is an example:
In the real context, x is the rate that describe how fast water flows through a filter medium, and y is the observed head loss through the filter column. Intuitively, it can be obtained from the plot that higher the filtration rate (i.e. the x), higher the head loss (i.e. the y). It is thus possible for the engineers to develop a mathematical model that correlates the filtration rate (the x) to the head loss (the y) so that in future given any x, they can predict the corresponding y.
In this article, I’m not going to propose the simplest model like y = mx+b that any spreadsheet would do easily. Also in this case, we can easily see that a linear relationship cannot really describe the relationship between the two parameters, because the correlation is a little bit parabolic. What we are trying to do is to develop a model like y = ax2+bx+c as follows:
The parabolic function makes more sense than the linear function, because according to the fluid mechanics, as the flow rate goes higher, the fluid tends to be more turbulent, which is totally different from the laminar flow scenarios at low flow rates. Common spreadsheets cannot do this, at least at a such high accuracy with the parameters. Let’s do this using some Python code with the gradient descent and normal equation algorithms.
First, let’s import relevant libraries and download the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltlink = 'https://raw.githubusercontent.com/waterprofessor/wre/main/data/crf.csv'
df = pd.read_csv(link)
For this demo analysis, we take the first 19 rows only, as they correspond to a complete run of an experiment. For x, we take the filtration rate column, while for y, we take the actual loss column. Plotting x vs. y returns the first figure above.
x = df.loc[0:18,'Filt rate(m3/m2h)'].values
y = df.loc[0:18,'Actual loss (m)'].values
plt.scatter(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Now, we need to construct a matrix X with a dimension of 19×3. Rows of 19 is easily understood, because that’s the number of tests. Columns of 3 means we need to have 3 features, which are mapped to the coefficients c, b, a, respectively, in the formula:
To construct the 3 feature columns, we name them x0, x1, and x2. The formula becomes:
We can see x0 is just a column of ones, x1 is a column of the original x (i.e. the filtration rate), and x2 is a newly-constructed column by squaring the x. Let’s construct x0, x1, and x2 and then stack them up into the matrix X of we desire. The X will have the shape of 19×3.
x0 = np.ones((19,1))
x1 = x.reshape((19,1))
x2 = (x**2).reshape((19,1))
x1_norm = x1/(np.max(x1)-np.min(x1))
x2_norm = x2/(np.max(x2)-np.min(x2))
X = np.hstack((x0,x1_norm,x2_norm))
y = y.reshape((19,1))
As you can see, I did one more step while constructing the X. I normalized the data using the min-max normalization method. This step is to make sure there is no huge difference between the values of the different columns, as the data in every column have been squeezed into a range between 0 and 1. Otherwise, the gradient descent algorithm may not converge.
After X and y are defined, we can initialize the parameters c, b, and a. We use a 3×1 matrix named theta to contain all of them. Essentially, theta[0,0] is c, theta[1,0] is b, and theta[2,0] is a. And we initialize the theta with some random values, which will be tweeked by the gradient descent algorithm later.
theta = np.random.randn(3).reshape((3,1))
Now we can start writing the gradient descent algorithm. The cost function we use is the mean squared error (MSE). Given h as the function of x with parameters theta, the MSE can be computed using the formula as follows:
The MSE can also be denoted as J — the symbol for cost function. To find the direction to tweek the parameters, we need to computer the derivatives of the J with respect to each theta. Then we figure out the length of the step (alpha) towards that direction. Our goal is to reach the bottom of the valley fastest. The alpha cannot be too small, otherwise, it would take very long time to reach the bottom and it may suffer at a local minimum. The alpha cannot be too big either, otherwise, the cost function J cannot converge and it may even diverge. The alpha is also called learning rate in the realm of machine learning. Typical alpha values to try are 0.0001, 0.001, 0.01, and 0.1. You could also try values between them, such as 0.0003, 0.003, and 0.03 for fine tuning.
The logic is simple except that the partial derivatives with respective to each theta seem difficult to compute. Well, we don’t have to go through the Calculus to derive the formula by ourselves because the formulas are already there and we can just use them:
As you can see, the algorithms for updating theta_0 and other theta values are different. Make sure you don’t mess them up. Here is the code for implementing the algorithm.
# Gradient Descent Method
i_li,mse_li = [],[]
alpha = 0.01
best_theta,best_mse = None,1e6for i in range(2000):
h = X@theta
mse = np.sum((h-y)**2)/len(h)
if mse < best_mse:
best_mse,best_theta = mse,theta
i_li.append(i)
mse_li.append(mse)
theta[0,0] = theta[0,0] - alpha*(np.sum(h-y)/len(h))
theta[1,0] = theta[1,0] - alpha*(np.sum((h-y)*X[:,[1]])/len(h))
theta[2,0] = theta[2,0] - alpha*(np.sum((h-y)*X[:,[2]])/len(h))
plt.plot(i_li,mse_li,'k-')
plt.xlabel('No. of Iterations')
plt.ylabel('MSE')
plt.show()
We did 2000 iterations with an alpha value of 0.01. We also keep tracking the cost function MSE after each iteration and keep the best theta values. The cost function change with the number of iterations are plotted as follows so that we can make sure the gradient descent works as it should.
The cost function MSE decreased nicely. This is what we were expecting and the code works well. Now let’s visualize our results. Since we have min-max normalized two input columns x1 and x2 before implementing the gradient descent algorithm, the derived theta values are based on the normalized columns. We need to convert the theta values back by division of their corresponding ranges to make them match the original data.
c = best_theta[0,0]
b = best_theta[1,0]/(np.max(x1)-np.min(x1))
a = best_theta[2,0]/(np.max(x2)-np.min(x2))plt.scatter(x,y)
ax = plt.gca()
plt.plot(x,a*x**2+b*x+c,'k-')
plt.text(0.01,0.95,'$y = ax^2+bx+c$',transform=ax.transAxes)
plt.text(0.01,0.9,'Gradient Descent Solution:',transform=ax.transAxes)
plt.text(0.01,0.85,f'$y = ({a:.5e})x^2+({b:.5f})x+({c:.5f})$',transform=ax.transAxes)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
After implementing the gradient descent algorithm, let’s try the normal equation method and explore if the two methods return the same result. Note that the normal equation method is an analytical method. The theta values are computed analytically with the following formula:
Converting it to code, we have the following:
# Normal Equation method
theta_ne = np.linalg.pinv(X.T@X)@(X.T)@y
Isn’t it much simpler than gradient descent? Now let’s visualize the results:
c1 = theta_ne[0,0]
b1 = theta_ne[1,0]/(np.max(x1)-np.min(x1))
a1 = theta_ne[2,0]/(np.max(x2)-np.min(x2))plt.scatter(x,y)
ax = plt.gca()
plt.plot(x,a1*x**2+b1*x+c1,'r-')
plt.text(0.01,0.95,'$y = ax^2+bx+c$',transform=ax.transAxes)
plt.text(0.01,0.9,'Normal Equation Solution:',transform=ax.transAxes)
plt.text(0.01,0.85,f'$y = ({a1:.5e})x^2+({b1:.5f})x+({c1:.5f})$',transform=ax.transAxes)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
The theta values are obviously different. Stacking it on top of the gradient descent result, we can see the difference easily:
# Comparision of the two methods
plt.scatter(x,y)
ax = plt.gca()
plt.plot(x,a*x**2+b*x+c,'k-',label='Gradient Descent')
plt.plot(x,a1*x**2+b1*x+c1,'r-',label='Normal Equation')
plt.text(0.01,0.95,'$y = ax^2+bx+c$',transform=ax.transAxes)
plt.text(0.01,0.9,'Gradient Descent Solution:',transform=ax.transAxes)
plt.text(0.01,0.85,f'$y = ({a:.5e})x^2+({b:.5f})x+({c:.5f})$',transform=ax.transAxes)
plt.text(0.01,0.8,'Normal Equation Solution:',transform=ax.transAxes)
plt.text(0.01,0.75,f'$y = ({a1:.5e})x^2+({b1:.5f})x+({c1:.5f})$',transform=ax.transAxes)
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='lower right')
plt.show()
Since both methods work just fine, which one would you prefer to use? Typically, it depends on how large your dataset is. If your dataset is very large, say 1 million rows x 1 million columns, gradient descent is significantly better, because it would be computationally expensive to compute the inversed matrix of such a large matrix. Also if your rows are less than columns, normal equation is not the best because it would result in a matrix that is non-invertible.
I hope you enjoy the article. I periodically publish articles of original content on the applications of machine learning and deep learning in the realms of quantitative trading, finance, and engineering. If you finish reading here and would like to see more of my writings, please follow me on Medium (https://medium.com/@decent_trade) or Twitter (https://twitter.com/decent_trade).