ML For Algo Trading Ch6 ~ The Machine Learning Process | by Tomoaki Fujii

This is a memo of ML trading application through the book [1].

We are going through evaluation of features through mutual information and Bias-Variance trade-off when constructing Machine Learning model. Firs,t we use data from House Sales in King County, USA of Kaggle.

import warnings
warnings.filterwarnings('ignore')from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.neighbors import (KNeighborsClassifier, 
KNeighborsRegressor)
from sklearn.model_selection import (cross_val_score, 
cross_val_predict, 
GridSearchCV)
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler, scale
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from yellowbrick.model_selection import ValidationCurve, LearningCurve
import statsmodels.api as sm
sns.set_style('whitegrid')house_sales = pd.read_csv("/home/tom/work/MLForTrading/Data/kc_house_data.csv")
house_sales = house_sales.drop(
['id', 'zipcode', 'lat', 'long', 'date'], axis=1)
house_sales.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 16 columns):
#   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
0   price          21613 non-null  float64
1   bedrooms       21613 non-null  int64  
2   bathrooms      21613 non-null  float64
3   sqft_living    21613 non-null  int64  
4   sqft_lot       21613 non-null  int64  
5   floors         21613 non-null  float64
6   waterfront     21613 non-null  int64  
7   view           21613 non-null  int64  
8   condition      21613 non-null  int64  
9   grade          21613 non-null  int64  
10  sqft_above     21613 non-null  int64  
11  sqft_basement  21613 non-null  int64  
12  yr_built       21613 non-null  int64  
13  yr_renovated   21613 non-null  int64  
14  sqft_living15  21613 non-null  int64  
15  sqft_lot15     21613 non-null  int64  
dtypes: float64(3), int64(13)
memory usage: 2.6 MB

Let’s see how target variable looks like.

sns.distplot(house_sales.price)
sns.despine();
plt.tight_layout();

sm.qqplot(house_sales.price, line ='45', fit=True);

=As you can see in the graph, price distribution is far from Gaussian. Most of ML models assume Gaussian noise. Skewed distribution variable is better to be transformed. Popular models are Box-Cox and Yeo Johnson. Please refer to the sklearn article [2]. We use a simple log transformation here.

X_all = house_sales.drop("price", axis=1)
y = np.log(house_sales.price)sns.distplot(y)
sns.despine();
plt.tight_layout();

sm.qqplot(y, line ='45', fit=True);

The distribution looks much better now!

Mutual Information Regression

Selecting a subset of useful features is important a step to enhance model. One of the methods is based on mutual information (MI). Compared to correlation coefficient, MI is able to extract non linear relationship.

We use the sklearn implementation [3] to estimate MI. sklearn employs K-Neighbors estimation rather than straight forward binning method [4].

mi_reg = pd.Series(mutual_info_regression(X_all, y),
index=X_all.columns).sort_values(ascending=False).sort_values()
print(mi_reg)
X = X_all.loc[:, mi_reg.iloc[:10].index]
mi_reg.plot.barh();yr_renovated     0.007650
condition        0.009350
waterfront       0.010187
view             0.058883
sqft_lot         0.062446
sqft_basement    0.067463
floors           0.072670
yr_built         0.073532
bedrooms         0.082541
sqft_lot15       0.084171
bathrooms        0.202087
sqft_above       0.259639
sqft_living15    0.272980
grade            0.348039
sqft_living      0.348507
dtype: float64

correl = X.apply(lambda x: spearmanr(x, y)[0]).sort_values()
print(correl.sort_values())
correl.sort_values().plot.barh();condition        0.018490
sqft_lot15       0.062766
sqft_lot         0.074939
yr_renovated     0.101876
yr_built         0.102038
waterfront       0.115089
sqft_basement    0.251704
view             0.293931
floors           0.322347
bedrooms         0.344652
dtype: float64

g = sns.pairplot(X_all.assign(price=y), y_vars=['price'], x_vars=X.columns)
sns.despine();

Mutual Information is more generic than correlation coefficient in the sense to detect if any relationship between multiple random variables. MI is, however, more difficult to be estimated and doesn’t tell you how random variables are correlated. For example, MI does not differentiate between positive and negative association.

Model Complexity

When constructing a model, the more complexity, the more variance with the less bias. Thus, we need to find a sweet spot with good balanced complexity. In the case of KNeighborsRegressor, the complexity is controlled by n_neighbors. The higher n_neighbors, the lower complexity. We use yellowbrick to visualize complexity vs performance.

from sklearn.metrics import (mean_squared_error,
mean_absolute_error,
mean_squared_log_error, 
median_absolute_error, 
explained_variance_score, 
r2_score)def rmse(y_true, pred):
return np.sqrt(mean_squared_error(y_true=y_true, y_pred=pred))
rmse_score = make_scorer(rmse)
n_neighbors = tuple(range(5, 101, 5))
fig, ax = plt.subplots(figsize=(16, 9))
val_curve = ValidationCurve(KNeighborsRegressor(),
param_name='n_neighbors',
param_range=n_neighbors,
cv=5,
scoring=rmse_score,
# n_jobs=-1,
ax=ax)
val_curve.fit(X, y)
val_curve.poof()
sns.despine()
fig.tight_layout();

<Figure size 800x550 with 0 Axes>

The gap between training and validation loss decreases as n_neighbors increase, i.e., complexity decreases.

Observation on the performance progress with the number of data samples can be used to estimate how much data is required. Generally speaking, if model is too complex, we require more data to have good performance.

fig, ax = plt.subplots(figsize=(16, 9))
l_curve = LearningCurve(KNeighborsRegressor(n_neighbors=50), 
train_sizes=np.arange(.1, 1.01, .1),
scoring=rmse_score, 
cv=5, 
n_jobs=-1,
ax=ax)
l_curve.fit(X, y)
l_curve.poof()
sns.despine()
fig.tight_layout();

<Figure size 800x550 with 0 Axes>

Evaluate Features based on Mutual Information.

Please refer to the data creation notebook to prepare data.

We evaluate features generated at CH4 through mutual information.

DATA_STORE = "/home/tom/work/MLForTrading/Data/assets.h5"with pd.HDFStore(DATA_STORE) as store:
data = store['engineered_features']
dummy_data = pd.get_dummies(data,
columns=['year','month', 'msize',  'sector'],
prefix=['year','month', 'msize', ''],
prefix_sep=['_', '_', '_', ''])
dummy_data = dummy_data.rename(columns={c:c.replace('.0', '') for c in dummy_data.columns})
dummy_data.info()<class 'pandas.core.frame.DataFrame'>
MultiIndex: 358914 entries, ('A', Timestamp('2001-01-31 00:00:00')) to ('ZUMZ', Timestamp('2018-02-28 00:00:00'))
Data columns (total 82 columns):
#   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
0   return_1m              358914 non-null  float64
1   return_2m              358914 non-null  float64
2   return_3m              358914 non-null  float64
3   return_6m              358914 non-null  float64
4   return_9m              358914 non-null  float64
5   return_12m             358914 non-null  float64
6   Mkt-RF                 358914 non-null  float64
7   SMB                    358914 non-null  float64
8   HML                    358914 non-null  float64
9   RMW                    358914 non-null  float64
10  CMA                    358914 non-null  float64
11  momentum_2             358914 non-null  float64
12  momentum_3             358914 non-null  float64
13  momentum_6             358914 non-null  float64
14  momentum_9             358914 non-null  float64
15  momentum_12            358914 non-null  float64
16  momentum_3_12          358914 non-null  float64
17  return_1m_t-1          357076 non-null  float64
18  return_1m_t-2          355238 non-null  float64
19  return_1m_t-3          353400 non-null  float64
20  return_1m_t-4          351562 non-null  float64
21  return_1m_t-5          349724 non-null  float64
22  return_1m_t-6          347886 non-null  float64
23  target_1m              358914 non-null  float64
24  target_2m              357076 non-null  float64
25  target_3m              355238 non-null  float64
26  target_6m              349724 non-null  float64
27  target_12m             338696 non-null  float64
28  year_2001              358914 non-null  uint8  
29  year_2002              358914 non-null  uint8  
30  year_2003              358914 non-null  uint8  
31  year_2004              358914 non-null  uint8  
32  year_2005              358914 non-null  uint8  
33  year_2006              358914 non-null  uint8  
34  year_2007              358914 non-null  uint8  
35  year_2008              358914 non-null  uint8  
36  year_2009              358914 non-null  uint8  
37  year_2010              358914 non-null  uint8  
38  year_2011              358914 non-null  uint8  
39  year_2012              358914 non-null  uint8  
40  year_2013              358914 non-null  uint8  
41  year_2014              358914 non-null  uint8  
42  year_2015              358914 non-null  uint8  
43  year_2016              358914 non-null  uint8  
44  year_2017              358914 non-null  uint8  
45  year_2018              358914 non-null  uint8  
46  month_1                358914 non-null  uint8  
47  month_2                358914 non-null  uint8  
48  month_3                358914 non-null  uint8  
49  month_4                358914 non-null  uint8  
50  month_5                358914 non-null  uint8  
51  month_6                358914 non-null  uint8  
52  month_7                358914 non-null  uint8  
53  month_8                358914 non-null  uint8  
54  month_9                358914 non-null  uint8  
55  month_10               358914 non-null  uint8  
56  month_11               358914 non-null  uint8  
57  month_12               358914 non-null  uint8  
58  msize_-1               358914 non-null  uint8  
59  msize_1                358914 non-null  uint8  
60  msize_2                358914 non-null  uint8  
61  msize_3                358914 non-null  uint8  
62  msize_4                358914 non-null  uint8  
63  msize_5                358914 non-null  uint8  
64  msize_6                358914 non-null  uint8  
65  msize_7                358914 non-null  uint8  
66  msize_8                358914 non-null  uint8  
67  msize_9                358914 non-null  uint8  
68  msize_10               358914 non-null  uint8  
69  Basic Industries       358914 non-null  uint8  
70  Capital Goods          358914 non-null  uint8  
71  Consumer Durables      358914 non-null  uint8  
72  Consumer Non-Durables  358914 non-null  uint8  
73  Consumer Services      358914 non-null  uint8  
74  Energy                 358914 non-null  uint8  
75  Finance                358914 non-null  uint8  
76  Health Care            358914 non-null  uint8  
77  Miscellaneous          358914 non-null  uint8  
78  Public Utilities       358914 non-null  uint8  
79  Technology             358914 non-null  uint8  
80  Transportation         358914 non-null  uint8  
81  Unknown                358914 non-null  uint8  
dtypes: float64(28), uint8(54)
memory usage: 96.6+ MBtarget_labels = [f'target_{i}m' for i in [1,2,3,6,12]]
targets = data.dropna().loc[:, target_labels]features = data.dropna().drop(target_labels, axis=1)
features.sector = pd.factorize(features.sector)[0]
cat_cols = ['year', 'month', 'msize',  'sector']
discrete_features = [features.columns.get_loc(c) for c in cat_cols]from sklearn.feature_selection import mutual_info_classifmutual_info = pd.DataFrame()
for label in target_labels:
mi = mutual_info_classif(X=features, 
y=(targets[label]> 0).astype(int),
discrete_features=discrete_features,
random_state=42
)
mutual_info[label] = pd.Series(mi, index=features.columns)mutual_info.sum()target_1m     0.034638
target_2m     0.057831
target_3m     0.090984
target_6m     0.132624
target_12m    0.198652
dtype: float64fig, ax= plt.subplots(figsize=(15, 4))
sns.heatmap(mutual_info.div(mutual_info.sum()).T, ax=ax, cmap='Blues');

When normalizing mutual information across features, we can generate comparable view across different target variables. In above graph, you can see that year variable looks the most important feature. Let’s look closer by generating one hot encoded variables.

target_labels = [f'target_{i}m' for i in [1, 2, 3, 6, 12]]
dummy_targets = dummy_data.dropna().loc[:, target_labels]dummy_features = dummy_data.dropna().drop(target_labels, axis=1)
cat_cols = [c for c in dummy_features.columns if c not in features.columns]
discrete_features = [dummy_features.columns.get_loc(c) for c in cat_cols]mutual_info_dummies = pd.DataFrame()
for label in target_labels:
mi = mutual_info_classif(X=dummy_features, 
y=(dummy_targets[label]> 0).astype(int),
discrete_features=discrete_features,
random_state=42
)    
mutual_info_dummies[label] = pd.Series(mi, index=dummy_features.columns)mutual_info_dummies.sum()target_1m     0.035752
target_2m     0.059713
target_3m     0.093602
target_6m     0.136360
target_12m    0.202925
dtype: float64fig, ax= plt.subplots(figsize=(4, 20))
sns.heatmap(mutual_info_dummies.div(mutual_info_dummies.sum()), ax=ax, cmap='Blues');

Although year_2008 looks the most important feature, it is biased due to volatility in that year. When building model, it is better to use features that does not allow you to classify what timestamp. One the methods to do that is Adversarial Validation [6].

Bias-Variance Tradeoff

We look into bias-variance tradeoff through Taylor’s expansion of sine function.

from scipy.special import factorial
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_errorfrom collections import defaultdict
import numpy as np
from numpy.random import randint, choice, normal, shuffle
import pandas as pd
def f(x, max_degree=9):
taylor = [(-1)**i * x ** e / factorial(e) for i, e in enumerate(range(1, max_degree, 2))]
return  np.sum(taylor, axis=0)max_degree = 5
fig, ax = plt.subplots(figsize=(14, 5))
x = np.linspace(-5, 5, 1000)data = pd.DataFrame({'y': f(x, max_degree), 'x': x})
data.plot(x='x', y='y', legend=False, ax=ax)
pd.Series(np.sin(x), index=x).plot(ax=ax, ls='--', lw=2, label='sine')
plt.legend();

fig, axes = plt.subplots(ncols=3, figsize=(15, 4))
x = np.linspace(-.5 * np.pi, 2.5 * np.pi, 1000)
true_function = pd.Series(np.sin(x), index=x)
n = 30
noise = .2
degrees = [1, 5, 15]
x_ = np.random.choice(x, size=n)
y_ = np.sin(x_)
y_ +=  normal(loc=0, scale=np.std(y_) * noise, size=n)
mse = defaultdict(list)
for i, degree in enumerate(degrees):
fit = np.poly1d(np.polyfit(x=x_, y=y_, deg=degree))
true_function.plot(ax=axes[i], c='darkred', lw=1, ls='--', label='True Function')
pd.Series(y_, index=x_).plot(style='.', label='Sample', ax=axes[i], c='k')
pd.Series(fit(x), index=x).plot(label='Model', ax=axes[i])
axes[i].set_ylim(-1.5, 1.5)
mse = mean_squared_error(fit(x), np.sin(x))
axes[i].set_title(f'Degree {degree} | MSE: {mse:,.2f}')
axes[i].legend()
axes[i].grid(False)
axes[i].axis(False)sns.despine()
fig.tight_layout();

When using higher order to fit, it fits well on the given points while unstable outside of samples. You need to use somewhere in the middle.

datasets = ['Train', 'Test']
X = {'Train': np.linspace(-1, 1, 1000), 'Test': np.linspace(1, 2, 500)}models = {'Underfit': 1, 'Right Fit': 5, 'Overfit': 9}
sample, noise = 25, .01
result = []
for i in range(100):
x_ = {d: choice(X[d], size=sample, replace=False) for d in datasets}
y_ = {d: f(x_[d], max_degree=5) for d in datasets}
y_['Train'] += normal(loc=0,
scale=np.std(y_['Train']) * noise,
size=sample)
trained_models = {
fit: np.poly1d(np.polyfit(x=x_['Train'], y=y_['Train'], deg=deg))
for fit, deg in models.items()
}
for fit, model in trained_models.items():
for dataset in datasets:
pred = model(x_[dataset])
result.append(
pd.DataFrame(
dict(x=x_[dataset],
Model=fit,
Data=dataset,
y=pred,
Error=pred - y_[dataset])))
result = pd.concat(result)
y = {d: f(X[d], max_degree=5) for d in datasets}
y['Train_noise'] = y['Train'] + normal(loc=0,
scale=np.std(y['Train']) * noise,
size=len(y['Train']))
colors = {'Underfit': 'darkblue', 'Right Fit': 'darkgreen', 'Overfit': 'darkred'}
test_data = result[result.Data == 'Test']
fig, axes = plt.subplots(ncols=2, figsize=(18, 7), sharey=True)
sns.boxplot(x='Model', y='Error', hue='Data', data=result, ax=axes[0], linewidth=2)
axes[0].set_title('In- vs Out-of-Sample Errors')
axes[0].axhline(0, ls='--', lw=1, color='k')
axes[0].set_ylabel('Symmetric Log Scale')
for model in colors.keys():
(test_data[(test_data['Model'] == model)]
.plot.scatter(x='x',
y='y',
ax=axes[1],
s=2,
color=colors[model],
alpha=.5,
label=model))
# pd.Series(y['Train'], index=X['Train']).sort_index().plot(ax=axes[1], title='Out-of-sample Predictions')
pd.DataFrame(dict(x=X['Train'], y=y['Train_noise'])).plot.scatter(x='x', y='y', ax=axes[1], c='k', s=1)
pd.Series(y['Test'], index=X['Test']).plot(color='black', lw=5, ls='--', ax=axes[1], label='Actuals')
axes[0].set_yscale('symlog')
axes[1].set_title('Out-of-Sample Predictions')
axes[1].legend()
axes[0].grid(False)
axes[1].grid(False)
sns.despine()
fig.tight_layout()
fig.suptitle('Bias - Variance Tradeoff: Under vs. Overfitting', fontsize=16)
fig.subplots_adjust(top=0.9)

def folds(train, test, nfolds):
shuffle(train)
shuffle(test)steps = (np.array([len(train), len(test)]) / nfolds).astype(int)
for fold in range(nfolds):
i, j = fold * steps
yield train[i:i + steps[0]], test[j: j+steps[1]]
def rmse(y, x, model):
return np.sqrt(mean_squared_error(y_true=y, y_pred=model.predict(x)))
def create_poly_data(data, degree):
return np.hstack((data.reshape(-1, 1) ** i) for i in range(degree + 1))
train_set = X['Train'] + normal(scale=np.std(f(X['Train']))) * .2
test_set =  X['Test'].copy()  
sample_sizes = np.arange(.1, 1.0, .01)
indices = ([len(train_set), len(test_set)] *
sample_sizes.reshape(-1, 1)).astype(int)
result = []
lr = LinearRegression()
for label, degree in models.items():
model_train = create_poly_data(train_set, degree)
model_test = create_poly_data(test_set, degree)
for train_idx, test_idx in indices:
train = model_train[:train_idx]
test = model_test[:test_idx]
train_rmse, test_rmse = [], []
for x_train, x_test in folds(train, test, 5):
y_train, y_test = f(x_train[:, 1]), f(x_test[:, 1])
lr.fit(X=x_train, y=y_train)
train_rmse.append(rmse(y=y_train, x=x_train, model=lr))
test_rmse.append(rmse(y=y_test, x=x_test, model=lr))
result.append([label, train_idx,
np.mean(train_rmse), np.std(train_rmse),
np.mean(test_rmse), np.std(test_rmse)])
result = (pd.DataFrame(result,
columns=['Model', 'Train Size',
'Train RMSE', 'Train RMSE STD',
'Test RMSE', 'Test RMSE STD'])
.set_index(['Model', 'Train Size']))fig, axes = plt.subplots(nrows=3, sharey=True, figsize=(16, 9))
for i, model in enumerate(models.keys()):
result.loc[model, ['Train RMSE', 'Test RMSE']].plot(ax=axes[i], title=f'Model: {model}', logy=True, lw=2)
axes[i].set_ylabel('Log RMSE')fig.suptitle('Learning Curves', fontsize=16)
fig.tight_layout()
sns.despine()
fig.subplots_adjust(top=.92);

Cross Validation

In order to select proper complexity for the model, cross validation is one way to estimate performance to select a set of parameters. While there are many variations, what I frequently use are

KFold is the basic cross validation method without shuffling. Since data points close to timestamp would shared the same information and leads to inflating the performance. I usually do not shuffle data sampling.

CPCV [7] (Combinatorial Purged CV): the differences from Kfold are

Using combinatory to increase the number of trials
Purging fold margins to avoid data leakage

Walkforward is more realistic and similar to the way you use for deployment. The disadvantage is allowing you to use less data than the other cross validation methods.

Source link