![](https://crypto4nerd.com/wp-content/uploads/2023/12/1FOmBfXstbNbbM7KTUooQIg.jpeg)
Now, let’s talk about incorporating machine learning into trading. Here’s a summary of what I did for the project :
- First, I collected the stock market data, using Python’s
yfinance
library
– Stock data was collected for Top 20 companies, according to S&P500
– Stock prices from 2020 to 2022 was taken for each company - After data collection, I calculated the market indicator values for the time-frame. In total, I did the calculations for 25 different market indicators — mainly designated to do the following 4 tasks :
(1) identifying recent trend
(2) measuring stock strength
(3) measuring stock volatility
(4) calculating trading volume - Then I did some data preprocessing, which included data cleaning, normalizing & integration, transformation into time-series data, validation and feature selection.
- After that, I designed a Gated Recurrent Unit (GRU) model, and a Long-Short Term Memory (LSTM) model for predicting the future stock values, based on the previous indicators’ values.
- In the next step, I evaluated the performance of GRU and LSTM by comparing with the actual prices, which showed promising results.
- Then I used the LSTM & GRU models to predict future stock data (for next 7 business days), based on the previous results (and corresponding indicator values)
- CONTINUING …..
Data Collection & Preprocessing
To get things going, the first crucial step is gathering historical data for the chosen stocks. Python, with the help of the yfinance
library, offers a seamless way to collect and process this essential information. Below is a step-by-step guide to the data collection process for that I have used :
- Import Necessary Libraries :
import pandas as pd
import yfinance as yf
- Functions for Data Check and Download :
def set_df(data) :
df = data.rename(columns={'Date': 'date','Open':'open','High':'high','Low':'low','Close':'close', 'Adj Close':'adj_close','Volume':'volume'})
return dfdef data_checker(df) :
assert df.isnull().values.sum() == 0 , 'Value MISSING'
assert df.isna().values.sum() == 0, 'Value N/A'
def date_col(df) :
df['date'] = pd.to_datetime(df.date)
return df
def data_download(company, start_date, end_date) :
df = yf.download(company, start=start_date, end=end_date)
df = df.reset_index()
return df
- Data downloading for TESLA (Ticker Symbol = ‘TSLA’) :
company = "TSLA"
start_date = "2020-01-01"
end_date = "2023-01-01"
df = data_download(company, start_date, end_date)
data = df.copy()
data = set_df(data)data_checker(data)
data = date_col(data)
- Calculation of the market indicators :
def calculate_ma(data) :
data['ma7'] = data['close'].rolling(window = 7).mean()
data['ma21'] = data['close'].rolling(window = 21).mean()def calculate_macd(data):
data['26ema'] = data['close'].ewm(span=26).mean()
data['12ema'] = data['close'].ewm(span=12).mean()
data['MACD'] = (data['12ema'] - data['26ema'])
def calculate_bollinger_bands(data):
data['20sd'] = data['close'].rolling(window=20).std()
data['upper_band'] = (data['close'].rolling(window=20).mean()) + (data['20sd'] * 2)
data['lower_band'] = (data['close'].rolling(window=20).mean()) - (data['20sd'] * 2)
def calculate_ema(data):
data['ema'] = data['close'].ewm(com=0.5).mean()
def calculate_momentum(data):
data['momentum'] = (data['close'] / 100) - 1
def calculate_rsi_30(data):
delta = data['close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(window=14, min_periods=1).mean()
avg_loss = loss.rolling(window=14, min_periods=1).mean()
rs = avg_gain / avg_loss
data['rsi_30'] = 100 - (100 / (1 + rs))
def calculate_cci_30(data):
tp = (data['high'] + data['low'] + data['close']) / 3
sma_tp = tp.rolling(window=30).mean()
mean_deviation = (abs(tp - sma_tp)).rolling(window=30).mean()
data['cci_30'] = (tp - sma_tp) / (0.015 * mean_deviation)
def calculate_dx_30(data):
high_diff = data['high'].diff()
low_diff = -data['low'].diff()
high_gain = high_diff.where(high_diff > low_diff, 0)
low_loss = -low_diff.where(low_diff > high_diff, 0)
avg_high_gain = high_gain.rolling(window=14).mean()
avg_low_loss = low_loss.rolling(window=14).mean()
rs = avg_high_gain / avg_low_loss
data['dx_30'] = (100 * (avg_high_gain - avg_low_loss) / (avg_high_gain + avg_low_loss)).abs()
def calculate_close_30(data):
data['close_30'] = data['close'].rolling(window=30).mean()
def calculate_close_60(data):
data['close_60'] = data['close'].rolling(window=60).mean()
def calculate_volatility(data, window_size):
data['Log_Returns'] = data['close'].pct_change().apply(lambda x: np.log(1 + x))
vol_col = 'Vola'+str(window_size)+'d'
data[vol_col] = data['Log_Returns'].rolling(window=window_size).std() * np.sqrt(252)
data[['close', vol_col]]
def calculate_volume_indicator(data, window_size):
data['Vol_Pct_Change'] = data['volume'].pct_change()
volu_col = 'Volu'+str(window_size)+'d_ff'
data[volu_col] = data['Vol_Pct_Change'].rolling(window=window_size).std() * np.sqrt(252)
def calculate_average_true_range(data, window=14):
high_low = data['high'] - data['low']
high_close = np.abs(data['high'] - data['close'].shift())
low_close = np.abs(data['low'] - data['close'].shift())
true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
average_true_range = true_range.rolling(window=window).mean()
data['atr'] = average_true_range
calculate_ma(data)
calculate_macd(data)
calculate_bollinger_bands(data)
calculate_ema(data)
calculate_momentum(data)
calculate_rsi_30(data)
calculate_cci_30(data)
calculate_dx_30(data)
calculate_close_30(data)
calculate_close_60(data)
calculate_volatility(data,10)
calculate_volatility(data,30)
calculate_volatility(data,60)
calculate_volume_indicator(data,10)
calculate_volume_indicator(data,30)
calculate_volume_indicator(data,60)
calculate_average_true_range(data)
Then I plotted some of the key indicators for visualizing how it operates, especially in identifying recent market trends, with overbuying and overselling stocks.
After that, I added Fourier analysis to understand the stock data better. It’s like a secret ingredient that shows hidden patterns, making the market understanding deeper. While indicators like MACD, RSI, and Bollinger Bands excel in capturing certain aspects of market behavior, Fourier analysis adds a layer of sophistication by unveiling patterns that might not be immediately evident. It enriches the analysis toolkit, complementing traditional indicators and offering a holistic view of the underlying forces driving stock price movements. Here’s what I found while working on different Fourier components :
Then I did some data cleaning, validation (checking for NaN values and dealing with it), followed by normalizing the values, and converting it into time-series data. Let’s discuss it one by one :
Handling NaN values : NaN values often sneak into the financial indicators, especially during the early periods where calculations depend on historical data. Picture this : a 7-day moving average (ma7) has its first six values set to NaN. It’s only fair; how can we calculate an average without a history to lean on! Tackling these gaps requires a filling component, and for this time-series data, I used forward and backward fill simultaneously, since this combination ffill().bfill()
works well for time-series data that often exhibits patterns and trends, where each data point has a connection to what came before and what follows.
data = data.ffill().bfill()
Next, I dropped the columns associated with stock price (since most of these prices are really similar to the closing price we are trying to predict, it makes the data heavily biased and might result into overfitting).
df_indicators = data.drop(['open', 'high', 'low', 'adj_close'], axis=1)
After that, I normalized the data to prepare it further for the upcoming prediction task. I utilized the MinMaxScaler
library to ensure consistency across indicators. MinMaxScaler
normalizes the numerical range of each input feature to a standardized 0 to 1, creating a more balanced dataset.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_normalized = df_indicators.copy()
df_normalized[df_normalized.columns[1:]] = scaler.fit_transform(df_normalized[df_normalized.columns[1:]])
Train Test Split & Conversion into Time-series data :
After I got the data prepared, it was time to split our time series. I divided my data-frame following the standard train test split ratio of 0.80 : 0.20, but it is noteworthy that unlike usual split-cases, I did not do any data shuffling so that the temporal sequence of stock prices is preserved. This is crucial for any time series data, as it ensures that our models learn from historical trends and patterns, contributing to more accurate predictions when applied to unseen data.
df_time_series = df_normalized.set_index('date')
train_size = int(len(df_time_series) * 0.8)
train, test = df_time_series.iloc[:train_size], df_time_series.iloc[train_size:]
Along with data splitting, I converted the stock data into a dynamic time series data with a 14-day window.
def create_time_series_data(data, time_steps=1):
X, y = [], []
for i in range(len(data) - time_steps):
X.append(data.iloc[i:(i + time_steps)].values)
y.append(data.iloc[i + time_steps]['close'])
return np.array(X), np.array(y).reshape(-1,1)
time_steps = 14
X_train, y_train = create_time_series_data(train, time_steps)
X_test, y_test = create_time_series_data(test, time_steps)