![](https://crypto4nerd.com/wp-content/uploads/2023/07/13RroWacuc_7lUnXFHXlgbw.png)
This is a memo of ML trading application through the book [1].
We are going through evaluation of features through mutual information and Bias-Variance trade-off when constructing Machine Learning model. Firs,t we use data from House Sales in King County, USA of Kaggle.
import warnings
warnings.filterwarnings('ignore')from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.neighbors import (KNeighborsClassifier,
KNeighborsRegressor)
from sklearn.model_selection import (cross_val_score,
cross_val_predict,
GridSearchCV)
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler, scale
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from yellowbrick.model_selection import ValidationCurve, LearningCurveimport statsmodels.api as sm
sns.set_style('whitegrid')
house_sales = pd.read_csv("/home/tom/work/MLForTrading/Data/kc_house_data.csv")
house_sales = house_sales.drop(
['id', 'zipcode', 'lat', 'long', 'date'], axis=1)
house_sales.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 21613 non-null float64
1 bedrooms 21613 non-null int64
2 bathrooms 21613 non-null float64
3 sqft_living 21613 non-null int64
4 sqft_lot 21613 non-null int64
5 floors 21613 non-null float64
6 waterfront 21613 non-null int64
7 view 21613 non-null int64
8 condition 21613 non-null int64
9 grade 21613 non-null int64
10 sqft_above 21613 non-null int64
11 sqft_basement 21613 non-null int64
12 yr_built 21613 non-null int64
13 yr_renovated 21613 non-null int64
14 sqft_living15 21613 non-null int64
15 sqft_lot15 21613 non-null int64
dtypes: float64(3), int64(13)
memory usage: 2.6 MB
Let’s see how target variable looks like.
sns.distplot(house_sales.price)
sns.despine();
plt.tight_layout();
sm.qqplot(house_sales.price, line ='45', fit=True);
=As you can see in the graph, price distribution is far from Gaussian. Most of ML models assume Gaussian noise. Skewed distribution variable is better to be transformed. Popular models are Box-Cox and Yeo Johnson. Please refer to the sklearn article [2]. We use a simple log transformation here.
X_all = house_sales.drop("price", axis=1)
y = np.log(house_sales.price)sns.distplot(y)
sns.despine();
plt.tight_layout();
sm.qqplot(y, line ='45', fit=True);
The distribution looks much better now!
Mutual Information Regression
Selecting a subset of useful features is important a step to enhance model. One of the methods is based on mutual information (MI). Compared to correlation coefficient, MI is able to extract non linear relationship.
We use the sklearn implementation [3] to estimate MI. sklearn employs K-Neighbors estimation rather than straight forward binning method [4].
mi_reg = pd.Series(mutual_info_regression(X_all, y),
index=X_all.columns).sort_values(ascending=False).sort_values()
print(mi_reg)
X = X_all.loc[:, mi_reg.iloc[:10].index]
mi_reg.plot.barh();yr_renovated 0.007650
condition 0.009350
waterfront 0.010187
view 0.058883
sqft_lot 0.062446
sqft_basement 0.067463
floors 0.072670
yr_built 0.073532
bedrooms 0.082541
sqft_lot15 0.084171
bathrooms 0.202087
sqft_above 0.259639
sqft_living15 0.272980
grade 0.348039
sqft_living 0.348507
dtype: float64
correl = X.apply(lambda x: spearmanr(x, y)[0]).sort_values()
print(correl.sort_values())
correl.sort_values().plot.barh();condition 0.018490
sqft_lot15 0.062766
sqft_lot 0.074939
yr_renovated 0.101876
yr_built 0.102038
waterfront 0.115089
sqft_basement 0.251704
view 0.293931
floors 0.322347
bedrooms 0.344652
dtype: float64
g = sns.pairplot(X_all.assign(price=y), y_vars=['price'], x_vars=X.columns)
sns.despine();
Mutual Information is more generic than correlation coefficient in the sense to detect if any relationship between multiple random variables. MI is, however, more difficult to be estimated and doesn’t tell you how random variables are correlated. For example, MI does not differentiate between positive and negative association.
Model Complexity
When constructing a model, the more complexity, the more variance with the less bias. Thus, we need to find a sweet spot with good balanced complexity. In the case of KNeighborsRegressor
, the complexity is controlled by n_neighbors
. The higher n_neighbors
, the lower complexity. We use yellowbrick
to visualize complexity vs performance.
from sklearn.metrics import (mean_squared_error,
mean_absolute_error,
mean_squared_log_error,
median_absolute_error,
explained_variance_score,
r2_score)def rmse(y_true, pred):
return np.sqrt(mean_squared_error(y_true=y_true, y_pred=pred))
rmse_score = make_scorer(rmse)
n_neighbors = tuple(range(5, 101, 5))
fig, ax = plt.subplots(figsize=(16, 9))
val_curve = ValidationCurve(KNeighborsRegressor(),
param_name='n_neighbors',
param_range=n_neighbors,
cv=5,
scoring=rmse_score,
# n_jobs=-1,
ax=ax)
val_curve.fit(X, y)
val_curve.poof()
sns.despine()
fig.tight_layout();
<Figure size 800x550 with 0 Axes>
The gap between training and validation loss decreases as n_neighbors
increase, i.e., complexity decreases.
Observation on the performance progress with the number of data samples can be used to estimate how much data is required. Generally speaking, if model is too complex, we require more data to have good performance.
fig, ax = plt.subplots(figsize=(16, 9))
l_curve = LearningCurve(KNeighborsRegressor(n_neighbors=50),
train_sizes=np.arange(.1, 1.01, .1),
scoring=rmse_score,
cv=5,
n_jobs=-1,
ax=ax)
l_curve.fit(X, y)
l_curve.poof()
sns.despine()
fig.tight_layout();
<Figure size 800x550 with 0 Axes>
Evaluate Features based on Mutual Information.
Please refer to the data creation notebook to prepare data.
We evaluate features generated at CH4 through mutual information.
DATA_STORE = "/home/tom/work/MLForTrading/Data/assets.h5"with pd.HDFStore(DATA_STORE) as store:
data = store['engineered_features']dummy_data = pd.get_dummies(data,
<class 'pandas.core.frame.DataFrame'>
columns=['year','month', 'msize', 'sector'],
prefix=['year','month', 'msize', ''],
prefix_sep=['_', '_', '_', ''])
dummy_data = dummy_data.rename(columns={c:c.replace('.0', '') for c in dummy_data.columns})
dummy_data.info()
MultiIndex: 358914 entries, ('A', Timestamp('2001-01-31 00:00:00')) to ('ZUMZ', Timestamp('2018-02-28 00:00:00'))
Data columns (total 82 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 return_1m 358914 non-null float64
1 return_2m 358914 non-null float64
2 return_3m 358914 non-null float64
3 return_6m 358914 non-null float64
4 return_9m 358914 non-null float64
5 return_12m 358914 non-null float64
6 Mkt-RF 358914 non-null float64
7 SMB 358914 non-null float64
8 HML 358914 non-null float64
9 RMW 358914 non-null float64
10 CMA 358914 non-null float64
11 momentum_2 358914 non-null float64
12 momentum_3 358914 non-null float64
13 momentum_6 358914 non-null float64
14 momentum_9 358914 non-null float64
15 momentum_12 358914 non-null float64
16 momentum_3_12 358914 non-null float64
17 return_1m_t-1 357076 non-null float64
18 return_1m_t-2 355238 non-null float64
19 return_1m_t-3 353400 non-null float64
20 return_1m_t-4 351562 non-null float64
21 return_1m_t-5 349724 non-null float64
22 return_1m_t-6 347886 non-null float64
23 target_1m 358914 non-null float64
24 target_2m 357076 non-null float64
25 target_3m 355238 non-null float64
26 target_6m 349724 non-null float64
27 target_12m 338696 non-null float64
28 year_2001 358914 non-null uint8
29 year_2002 358914 non-null uint8
30 year_2003 358914 non-null uint8
31 year_2004 358914 non-null uint8
32 year_2005 358914 non-null uint8
33 year_2006 358914 non-null uint8
34 year_2007 358914 non-null uint8
35 year_2008 358914 non-null uint8
36 year_2009 358914 non-null uint8
37 year_2010 358914 non-null uint8
38 year_2011 358914 non-null uint8
39 year_2012 358914 non-null uint8
40 year_2013 358914 non-null uint8
41 year_2014 358914 non-null uint8
42 year_2015 358914 non-null uint8
43 year_2016 358914 non-null uint8
44 year_2017 358914 non-null uint8
45 year_2018 358914 non-null uint8
46 month_1 358914 non-null uint8
47 month_2 358914 non-null uint8
48 month_3 358914 non-null uint8
49 month_4 358914 non-null uint8
50 month_5 358914 non-null uint8
51 month_6 358914 non-null uint8
52 month_7 358914 non-null uint8
53 month_8 358914 non-null uint8
54 month_9 358914 non-null uint8
55 month_10 358914 non-null uint8
56 month_11 358914 non-null uint8
57 month_12 358914 non-null uint8
58 msize_-1 358914 non-null uint8
59 msize_1 358914 non-null uint8
60 msize_2 358914 non-null uint8
61 msize_3 358914 non-null uint8
62 msize_4 358914 non-null uint8
63 msize_5 358914 non-null uint8
64 msize_6 358914 non-null uint8
65 msize_7 358914 non-null uint8
66 msize_8 358914 non-null uint8
67 msize_9 358914 non-null uint8
68 msize_10 358914 non-null uint8
69 Basic Industries 358914 non-null uint8
70 Capital Goods 358914 non-null uint8
71 Consumer Durables 358914 non-null uint8
72 Consumer Non-Durables 358914 non-null uint8
73 Consumer Services 358914 non-null uint8
74 Energy 358914 non-null uint8
75 Finance 358914 non-null uint8
76 Health Care 358914 non-null uint8
77 Miscellaneous 358914 non-null uint8
78 Public Utilities 358914 non-null uint8
79 Technology 358914 non-null uint8
80 Transportation 358914 non-null uint8
81 Unknown 358914 non-null uint8
dtypes: float64(28), uint8(54)
memory usage: 96.6+ MBtarget_labels = [f'target_{i}m' for i in [1,2,3,6,12]]
targets = data.dropna().loc[:, target_labels]features = data.dropna().drop(target_labels, axis=1)
features.sector = pd.factorize(features.sector)[0]cat_cols = ['year', 'month', 'msize', 'sector']
from sklearn.feature_selection import mutual_info_classif
discrete_features = [features.columns.get_loc(c) for c in cat_cols]mutual_info = pd.DataFrame()
mutual_info.sum()target_1m 0.034638
for label in target_labels:
mi = mutual_info_classif(X=features,
y=(targets[label]> 0).astype(int),
discrete_features=discrete_features,
random_state=42
)
mutual_info[label] = pd.Series(mi, index=features.columns)
target_2m 0.057831
target_3m 0.090984
target_6m 0.132624
target_12m 0.198652
dtype: float64fig, ax= plt.subplots(figsize=(15, 4))
sns.heatmap(mutual_info.div(mutual_info.sum()).T, ax=ax, cmap='Blues');
When normalizing mutual information across features, we can generate comparable view across different target variables. In above graph, you can see that year variable looks the most important feature. Let’s look closer by generating one hot encoded variables.
target_labels = [f'target_{i}m' for i in [1, 2, 3, 6, 12]]
dummy_targets = dummy_data.dropna().loc[:, target_labels]dummy_features = dummy_data.dropna().drop(target_labels, axis=1)
mutual_info_dummies = pd.DataFrame()
cat_cols = [c for c in dummy_features.columns if c not in features.columns]
discrete_features = [dummy_features.columns.get_loc(c) for c in cat_cols]
for label in target_labels:
mi = mutual_info_classif(X=dummy_features,
y=(dummy_targets[label]> 0).astype(int),
discrete_features=discrete_features,
random_state=42
)
mutual_info_dummies[label] = pd.Series(mi, index=dummy_features.columns)mutual_info_dummies.sum()target_1m 0.035752
target_2m 0.059713
target_3m 0.093602
target_6m 0.136360
target_12m 0.202925
dtype: float64fig, ax= plt.subplots(figsize=(4, 20))
sns.heatmap(mutual_info_dummies.div(mutual_info_dummies.sum()), ax=ax, cmap='Blues');
Although year_2008
looks the most important feature, it is biased due to volatility in that year. When building model, it is better to use features that does not allow you to classify what timestamp. One the methods to do that is Adversarial Validation [6].
Bias-Variance Tradeoff
We look into bias-variance tradeoff through Taylor’s expansion of sine function.
from scipy.special import factorial
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_errorfrom collections import defaultdict
import numpy as np
from numpy.random import randint, choice, normal, shuffle
import pandas as pddef f(x, max_degree=9):
max_degree = 5
taylor = [(-1)**i * x ** e / factorial(e) for i, e in enumerate(range(1, max_degree, 2))]
return np.sum(taylor, axis=0)
fig, ax = plt.subplots(figsize=(14, 5))
x = np.linspace(-5, 5, 1000)data = pd.DataFrame({'y': f(x, max_degree), 'x': x})
data.plot(x='x', y='y', legend=False, ax=ax)
pd.Series(np.sin(x), index=x).plot(ax=ax, ls='--', lw=2, label='sine')
plt.legend();
fig, axes = plt.subplots(ncols=3, figsize=(15, 4))
x = np.linspace(-.5 * np.pi, 2.5 * np.pi, 1000)
true_function = pd.Series(np.sin(x), index=x)
n = 30
noise = .2
degrees = [1, 5, 15]
x_ = np.random.choice(x, size=n)
y_ = np.sin(x_)
y_ += normal(loc=0, scale=np.std(y_) * noise, size=n)
mse = defaultdict(list)
for i, degree in enumerate(degrees):
fit = np.poly1d(np.polyfit(x=x_, y=y_, deg=degree))
true_function.plot(ax=axes[i], c='darkred', lw=1, ls='--', label='True Function')
pd.Series(y_, index=x_).plot(style='.', label='Sample', ax=axes[i], c='k')
pd.Series(fit(x), index=x).plot(label='Model', ax=axes[i])
axes[i].set_ylim(-1.5, 1.5)
mse = mean_squared_error(fit(x), np.sin(x))
axes[i].set_title(f'Degree {degree} | MSE: {mse:,.2f}')
axes[i].legend()
axes[i].grid(False)
axes[i].axis(False)sns.despine()
fig.tight_layout();
When using higher order to fit, it fits well on the given points while unstable outside of samples. You need to use somewhere in the middle.
datasets = ['Train', 'Test']
X = {'Train': np.linspace(-1, 1, 1000), 'Test': np.linspace(1, 2, 500)}models = {'Underfit': 1, 'Right Fit': 5, 'Overfit': 9}
sample, noise = 25, .01
result = []
for i in range(100):
x_ = {d: choice(X[d], size=sample, replace=False) for d in datasets}
y_ = {d: f(x_[d], max_degree=5) for d in datasets}
y_['Train'] += normal(loc=0,
scale=np.std(y_['Train']) * noise,
size=sample)
trained_models = {
fit: np.poly1d(np.polyfit(x=x_['Train'], y=y_['Train'], deg=deg))
for fit, deg in models.items()
}
for fit, model in trained_models.items():
for dataset in datasets:
pred = model(x_[dataset])
result.append(
pd.DataFrame(
dict(x=x_[dataset],
Model=fit,
Data=dataset,
y=pred,
Error=pred - y_[dataset])))
result = pd.concat(result)
y = {d: f(X[d], max_degree=5) for d in datasets}
y['Train_noise'] = y['Train'] + normal(loc=0,
scale=np.std(y['Train']) * noise,
size=len(y['Train']))
colors = {'Underfit': 'darkblue', 'Right Fit': 'darkgreen', 'Overfit': 'darkred'}
test_data = result[result.Data == 'Test']
fig, axes = plt.subplots(ncols=2, figsize=(18, 7), sharey=True)
sns.boxplot(x='Model', y='Error', hue='Data', data=result, ax=axes[0], linewidth=2)
axes[0].set_title('In- vs Out-of-Sample Errors')
axes[0].axhline(0, ls='--', lw=1, color='k')
axes[0].set_ylabel('Symmetric Log Scale')
for model in colors.keys():
(test_data[(test_data['Model'] == model)]
.plot.scatter(x='x',
y='y',
ax=axes[1],
s=2,
color=colors[model],
alpha=.5,
label=model))
# pd.Series(y['Train'], index=X['Train']).sort_index().plot(ax=axes[1], title='Out-of-sample Predictions')
pd.DataFrame(dict(x=X['Train'], y=y['Train_noise'])).plot.scatter(x='x', y='y', ax=axes[1], c='k', s=1)
pd.Series(y['Test'], index=X['Test']).plot(color='black', lw=5, ls='--', ax=axes[1], label='Actuals')
axes[0].set_yscale('symlog')
axes[1].set_title('Out-of-Sample Predictions')
axes[1].legend()
axes[0].grid(False)
axes[1].grid(False)
sns.despine()
fig.tight_layout()
fig.suptitle('Bias - Variance Tradeoff: Under vs. Overfitting', fontsize=16)
fig.subplots_adjust(top=0.9)
def folds(train, test, nfolds):
shuffle(train)
shuffle(test)steps = (np.array([len(train), len(test)]) / nfolds).astype(int)
for fold in range(nfolds):
i, j = fold * steps
yield train[i:i + steps[0]], test[j: j+steps[1]]def rmse(y, x, model):
return np.sqrt(mean_squared_error(y_true=y, y_pred=model.predict(x)))def create_poly_data(data, degree):
return np.hstack((data.reshape(-1, 1) ** i) for i in range(degree + 1))train_set = X['Train'] + normal(scale=np.std(f(X['Train']))) * .2
test_set = X['Test'].copy()sample_sizes = np.arange(.1, 1.0, .01)
indices = ([len(train_set), len(test_set)] *
sample_sizes.reshape(-1, 1)).astype(int)
result = []
lr = LinearRegression()
for label, degree in models.items():
model_train = create_poly_data(train_set, degree)
model_test = create_poly_data(test_set, degree)
for train_idx, test_idx in indices:
train = model_train[:train_idx]
test = model_test[:test_idx]
train_rmse, test_rmse = [], []
for x_train, x_test in folds(train, test, 5):
y_train, y_test = f(x_train[:, 1]), f(x_test[:, 1])
lr.fit(X=x_train, y=y_train)
train_rmse.append(rmse(y=y_train, x=x_train, model=lr))
test_rmse.append(rmse(y=y_test, x=x_test, model=lr))
result.append([label, train_idx,
np.mean(train_rmse), np.std(train_rmse),
np.mean(test_rmse), np.std(test_rmse)])result = (pd.DataFrame(result,
fig, axes = plt.subplots(nrows=3, sharey=True, figsize=(16, 9))
columns=['Model', 'Train Size',
'Train RMSE', 'Train RMSE STD',
'Test RMSE', 'Test RMSE STD'])
.set_index(['Model', 'Train Size']))
for i, model in enumerate(models.keys()):
result.loc[model, ['Train RMSE', 'Test RMSE']].plot(ax=axes[i], title=f'Model: {model}', logy=True, lw=2)
axes[i].set_ylabel('Log RMSE')fig.suptitle('Learning Curves', fontsize=16)
fig.tight_layout()
sns.despine()
fig.subplots_adjust(top=.92);
Cross Validation
In order to select proper complexity for the model, cross validation is one way to estimate performance to select a set of parameters. While there are many variations, what I frequently use are
KFold is the basic cross validation method without shuffling. Since data points close to timestamp would shared the same information and leads to inflating the performance. I usually do not shuffle data sampling.
CPCV [7] (Combinatorial Purged CV): the differences from Kfold are
- Using combinatory to increase the number of trials
- Purging fold margins to avoid data leakage
Walkforward is more realistic and similar to the way you use for deployment. The disadvantage is allowing you to use less data than the other cross validation methods.