![](https://crypto4nerd.com/wp-content/uploads/2023/07/1dWZX-sNB3aj7MOPGH9WH7A.png)
We will use QQQ as a case study. We start with a pandas dataframe containing price (open, high, low, close), volume and several technical indicators that have been previously generated.
We analyze the returns of these indicators (when used individually) from 3 January 2023 to 26 June 2023. For example, the MACD generated a 26% return on its own. However, this is less than a simple buy-and-hold performance of 35% over a similar period.
- Features Engineering
To help the model understand the context of the price behavior environment, we will engineer features from the price and volume data. The new features we have added for this case study are the Hurst Exponent, Price_Strength, Relative_Volume and Close_Open. The Hurst Exponent is a statistical measure that attempts to quantify whether a time-series is mean-reverting or momentum following. We construct an index to measure Price_Strength, setting the index to be 100 when the price closes at the high of the day and 0 if the price closes at the low of the day. We define Relative_Volume as the volume of the current period, divided by the rolling average volume of the past 20 trading days. Lastly, Close_Open is obtained by dividing the closing price by the opening price.
2. Generating Labels
Next, to convert the prediction of price movements into a classification problem, we generate labels. For this purpose, we use the PeakDetect library, which allows us to identify the peaks and troughs of the share price. Despite having a look-ahead bias that essentially stimulates a series of nearly perfect trades, our goal is to train a classifier to use technical indicators and the features we have generated to potentially identify trades that come as close to the perfect scenario as possible.
The chart below illustrates that trading QQQ with hindsight on peaks and troughs would have yielded ~60% return since the start of 2023. However, this is unrealistic and unachievable in practice.
3. Incorporating Technical Indicators and Other Features into a Random Forest Classifier
The next step is to split our data into training and testing datasets. Given the sequential nature of our data, we did not use the Train_Test_Split from sklearn. Instead, we manually split the data, using historical data from 1 January 2020 to 31 December 2022 for training, and data from 1 January 2023 onwards for testing. The reason for not using a longer historical series to train our model is because most classifiers assume data stationarity; macro-conditions change significantly the further back we go, therefore, forecasting performance may be compromised.
Before training our Random Forest model, we use RandomizedSearchCV to identify optimal hyperparameters.
Finally, we input these optimal hyperparameters into our model. Here, we use a pipeline to call the StandardScaler simultaneously. The achieved accuracy of our model is an impressive 85%.
We also examine which of the features are the most important to arrive at this classification result.
We note that using a single technical indicator, MACD, would have yielded an accuracy of only 63.3%, considerably lower than what we achieved with the ensemble approach. This highlights the value of combining multiple indicators and providing the model with additional feature context generated from price and volume data.
How does this translate into trading performance? Please note that our simulation was conducted using a 0.25% transaction cost (single-way) and we used the average price of the subsequent day as the entry/exit price when a signal switch was detected.
The ensemble price signal (performance_RF) had tracked the optimal trading performance remarkably well.