Predicting Indonesia’s INFOBANK15 Index using Long-Short Term Memory (LSTM) | by DimasDaniswara

“What would the stock price for tomorrow be? will it rise or fall? should I buy, sell or hold ?” These are the common questions that stock investors would often pondered. However, stock price prediction is an exceptionally challenging task and the reason for it lies within the nature of stock price movement.

Stock price movement will exhibit fluctuations due to other variables such as macroeconomic variable. Which is why stock price is called a nonlinear data. To predict such data simple linear regression won’t cut it, therefore the solution that I and many have proposed is to use a deep-learning model called Long-Short Term Memory (LSTM).

LSTM is a powerful model that is commonly used for time series prediction. Due to the gating mechanism, the model can learn the long-term pattern of the stock price without worrying about vanishing/exploding gradient.

The aim of this research is to be able to build an LSTM model that is robust and accurately learn the historical pattern of the INFOBANK15 index. The best model is then used to predict the future price of the index.

The INFOBANK15 index is a stock index that consists 15 of Indonesia’s leading banks based on good fundamentals, large market capitalization and high liquidity. From the index we will use the historical closing price and opening price. Other variables, namely macroeconomic variables, are selected based on their significance on the overall economy. The data are collected from January 2023 until August 2023.

After some research the selected variables are :

Close Price = The daily closing index of the INFOBANK15
Opening Price = The daily opening price of the INFOBANK15
Interest Rate = Indonesia’s monthly interest rate
Gold Price = Daily gold’s price (in USD)
USD/IDR = USD Exchange rate to IDR
Consumer Confidence = The consumer’s sentiment regarding current income , job availability and general economic conditions

The sources that is used to collect the variables are listed in the table below:

Since the goal is to predict the index, then the variable Close Price will be the target variable. The other variables above also have some level of significance towards the target variable and can be illustrated by a heatmap.

The level of significance can be explained using the intensity of the colors. Dark green means the relationship between variables is positively correlated, on the other hand dark red means the relationship is negatively correlated. Therefore, we can assess which variable has high/low correlation with the target variable.

As an example, Consumer Confidence and Close price have a correlation value of 0.64 , which is fairly positive. Hence, Consumer Confidence might be a vital variable for predicting the index. However, high correlation can also implies multicollinearity problem which undermines statistical significance between independent variables. As a threshold, the correlation value of 0.9 will determine if an independent variable have multicollinearity or not. Based on the figure above, opening price will be removed because it has a correlation of 0.96 with close price. After opening price, none of the other variables exceed the threshold.

There are a total of 147 rows of data collected from january 2023 until mid August 2023. The dataset will be split into training data and testing data. Training data will used 80% of the total dataset and testing data will take the remaining 20%.

The model used will be a single layer LSTM with variation to the number of nodes, learning rate and optimizer used.

Nodes = [10, 50, 100, 150]

Learning rate = [0.1, 0.01, 0.001]

Batch Size = [4, 8, 16]

Optimizer = [Adam, Adagrad]

There are a total of 24 combination that will be used. Every variation will be evaluated using the training data . We call this step Hyperparameter tuning. The variation that gives the best accuracy will be used to model the test data. Then the model that does well with the test data is used to forecast the future value of INFOBANK15 index.

After exploring all of the 24 combination the models that give the best accuracy on the training dataset is

The three models above will be further evaluated using the test data. To do that we can use the code below:

best_hyperparameter = [[4,30],[8,50],[16,150]]
rmse = np.zeros(3)
increment = 0
for i,j in best_hyperparameter:
model2 = Sequential()
opt = Adam(lr=0.01)
model2.add(LSTM(units=j, input_shape=(timestep, X_train.shape[2])))
model2.add(Dense(1, activation='linear'))
model2.compile(optimizer=opt, loss='mean_squared_error')history = model2.fit(X_train,y_train,epochs=50,batch_size =i)
test_predict = model2.predict(X_test)
test_predict = scaler1.inverse_transform(test_predict.reshape(-1,1))
test_actual = scaler1.inverse_transform(y_test)
rmse[increment] = mean_squared_error(test_actual,test_predict,squared=False)
if increment == 0:
model2.save('lstm0_model.h5')
elif increment == 1:
model2.save('lstm1_model.h5')
elif increment == 2:
model2.save('lstm2_model.h5')
increment = increment + 1

After the code is ran, we will have three values of RMSE or root mean squared error that corresponds with the three models. We can compare the three model’s accuracy using the RMSE value to determine which model is best.

It turns out the model lstm1 gives the best accuracy in comparison to model lstm0 and lstm2. Lstm1 uses 50 neuron, batch size of 8 and learning rate of 0.01 to produce an RMSE of less than 12.5 . Keep in mind that the unit of RMSE follows the unit of it’s target variable which is the closing price of the INFOBANK15 index. Because index don’t have a unit, for convenience, let’s refer the unit as point. If a model have a RMSE value of 12.5 that means the prediction value produce by the model is off by 12.5 points from the actual data. Then the prediction from lstm1 therefore is off by less then 12.5 or to be more precise is off by 10.94 .

Plot of actual closing index with its prediction on training and testing data.

Plot of actual closing index and prediction in test data.

As you can see in the first graph , model lstm1 perform very well in the training dataset because the predicted value (orange line) is almost identical to the actual value (blue line). however, notice how the orange line is slightly higher than the blue line , this means that the model may overestimated the actual value. On the other hand, lstm1 also did pretty well in the test data but there is a much bigger gap than in the training data which is to be expected because the test data isn’t part of the data that the model was train on but a RMSE value of 10.94 on the test data tells us that the error produce on the test data is still acceptable and proving the robustness of the model when face with unfamiliar data.

Finally after finding the best model we can use that model to predict the future value of the index. Our goal here is to predict the INFOBANK15 index value from 19 August 2023 until the end of the year. We then can use the results of the forecast to gauge the performance of the banking sector as a whole and also we can use it to plan our investment.

Forecasting using LSTM is a little bit complex in my opinion but it is still possible. Depending on how many historical price / index data we use to train the model, LSTM can only produce one outcome that is the next day price. For example if we use the closing price in the last 5 days the LSTM model will produce/predict for the 6th day and if we want to predict the price for 7th day the model will need data from the 2nd day until the 6th day. Therefore, because the available data was until August 18th the model can only predict until August 19th , to push beyond that we need to append the prediction that the model produce with the original dataset and use that as the new input for the LSTM model and we do this step over and over again until the model can predict until the last day of the year.

There is also another problem that need to be dealt with. For training the lstm1 we used 5 variables (Price, Consumer Confidence, USD/IDX , Gold Price, Interest Rate) because of that the model expect 5 variables as input. However lstm1 only produce one ouput that is the price/index therefore when we append the prediction to the training data we’re still missing four variables (Consumer Confidence, USD/IDX , Gold Price, Interest Rate). My approach to ‘fill’ the missing variables is to generate a number based on the normal distribution of each of the four variables. That way the input will always have 5 variables. Here is the code that i use to do that :

list_predict = []
future_seq = X_test.copy()
for i in range(135):
next_seq = best_model.predict(future_seq)
list_predict.append(next_seq[-1,-1])
random_usd = np.clip(np.random.normal(mean_usd, std_usd), 0.7, 1.0)
random_emas = np.clip(np.random.normal(mean_emas, std_emas), 0.7, 1.0)
random_CC = np.clip(np.random.normal(mean_CC, std_CC), 0.8, 1.0)
random_interest = np.clip(np.random.normal(mean_interest, std_interest), 0.8, 1.0)
# additional_values = np.array([[next_seq[-1,-1], random.gauss(0.392791,0.22526), random.gauss(0.497931,0.236947), random.gauss(0.533496,0.305697), random.gauss(0.911565,0.088000)]])
# additional_values = np.expand_dims(additional_values, axis=1)  # Expand dimensions to match the shape
additional_values = np.array([[next_seq[-1,-1], random_usd, random_emas, random_CC, random_interest]])
future_seq = np.concatenate((future_seq[-1,1:,:],additional_values))
future_seq = future_seq.reshape(1,5,5)
# future_seq = np.expand_dims(future_seq, axis=0)list_predict = np.array(list_predict)
true_extrapolation = scaler1.inverse_transform(list_predict.reshape(-1,1))

The variable true_extrapolation contains the value of the forecast. We then can visualize it with the test data for better clarity.

Based on the graph above, it would seem that from 19 August the index will drop from 1220 to near 1190 (-2,45% change) then it will form a sideways pattern until the end of year with the value ranging between 1180–1200. This implies that the banking sector is relatively stable but also is in a period of uncertainty. However, this also present a potential breakout for the banking sector, so i suggest that investors and traders stay up to date with the relevant news and events that could impact the banking sector.

Out of the three models, model lstm1 is the most accurate with a RMSE score of 10.94 . The model uses a combination of 50 neuron , batch size of 8 and learning rate of 0.01 . On the other hand, lstm0 and lstm2 have a RMSE score of over 15.
Model lstm1 predicted a small downtred from mid august until end of august followed by a sideways pattern from early september until the end of the year.

The aspiration of equity traders, individual investors, and portfolio managers is to accurately forecast stock prices and, as a result, anticipate potential returns. This study reveals the encouraging potential of the LSTM neural network in defining the range of uncertainty in predicting stock prices. Nevertheless, it is not advisable to make investment decisions solely relying on this research. Investors are urged to conduct their thorough research and take into account their risk tolerance in different market scenarios. Sound forecasting depends not just on the results of a particular model but also on the unpredictable nature of the stock market, especially in times of geopolitical tensions, disruptions in the global supply chain, conflicts, pandemics, and other diverse situations.

Source link