Q18 Machine Learning on a Rolling Basis¶
This example shows how to make a submission to the stock contest using machine learning and retraining.
You can clone and edit this example there (tab Examples).
In this example we predict whether the price will rise or fall by using supervised learning (Bayesian Ridge Regression). This template represents a starting point for developing a system which can take part to the Q18 NASDAQ-100 Stock Long-Short contest.
It consists of two parts.
In the first part we just perform a global training of the time series using all time series data. We disregard the sequential aspect of the data and use also future data to train past data.
In the second part we use the built-in backtester and perform training and prediction on a rolling basis in order to avoid forward looking. Please note that we are using a specialized version of the Quantiacs backtester which dramatically speeds up the the backtesting process by retraining your model on a regular basis.
Features for learning: we will use several technical indicators trying to capture different features. You can have a look at Technical Indicators.
Please note that:
Your trading algorithm can open short and long positions.
At each point in time your algorithm can trade all or a subset of the stocks which at that point of time are or were part of the NASDAQ-100 stock index. Note that the composition of this set changes in time, and Quantiacs provides you with an appropriate filter function for selecting them.
The Sharpe ratio of your system since January 1st, 2006, has to be larger than 1.
Your system cannot be a copy of the current examples. We run a correlation filter on the submissions and detect duplicates.
For simplicity we will use a single asset. It pays off to use more assets, ideally uncorrelated, and diversify your positions for a more solid Sharpe ratio.
More details on the rules can be found here.
Need help? Check the Documentation and find solutions/report problems in the Forum section.
More help with Jupyter? Check the official Jupyter page.
Once you are done, click on Submit to the contest and take part to our competitions.
API reference:
data: check how to work with data;
backtesting: read how to run the simulation and check the results.
Need to use the optimizer function to automate tedious tasks?
- optimization: read more on our article.
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) { return false; }
// disable widget scrolling
import logging
import xarray as xr # xarray for data manipulation
import qnt.data as qndata # functions for loading data
import qnt.backtester as qnbt # built-in backtester
import qnt.ta as qnta # technical analysis library
import qnt.stats as qnstats # statistical functions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.seterr(divide = "ignore")
from qnt.ta.macd import macd
from qnt.ta.rsi import rsi
from qnt.ta.stochastic import stochastic_k, stochastic, slow_stochastic
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error
# loading nasdaq-100 stock data
stock_data = qndata.stocks.load_ndx_data(tail = 365 * 5, assets = ["NAS:AAPL", "NAS:AMZN"])
def get_features(data):
"""Builds the features used for learning:
* a trend indicator;
* the moving average convergence divergence;
* a volatility measure;
* the stochastic oscillator;
* the relative strength index;
* the logarithm of the closing price.
These features can be modified and new ones can be added easily.
"""
# trend:
trend = qnta.roc(qnta.lwma(data.sel(field="close"), 60), 1)
# moving average convergence divergence (MACD):
macd = qnta.macd(data.sel(field="close"))
macd2_line, macd2_signal, macd2_hist = qnta.macd(data, 12, 26, 9)
# volatility:
volatility = qnta.tr(data.sel(field="high"), data.sel(field="low"), data.sel(field="close"))
volatility = volatility / data.sel(field="close")
volatility = qnta.lwma(volatility, 14)
# the stochastic oscillator:
k, d = qnta.stochastic(data.sel(field="high"), data.sel(field="low"), data.sel(field="close"), 14)
# the relative strength index:
rsi = qnta.rsi(data.sel(field="close"))
# the logarithm of the closing price:
price = data.sel(field="close").ffill("time").bfill("time").fillna(0) # fill NaN
price = np.log(price)
# combine the six features:
result = xr.concat(
[trend, macd2_signal.sel(field="close"), volatility, d, rsi, price],
pd.Index(
["trend", "macd", "volatility", "stochastic_d", "rsi", "price"],
name = "field"
)
)
return result.transpose("time", "field", "asset")
# displaying the features:
my_features = get_features(stock_data)
display(my_features.sel(field="trend").to_pandas())
def get_target_classes(data):
""" Target classes for predicting if price goes up or down."""
price_current = data.sel(field="close")
price_future = qnta.shift(price_current, -1)
class_positive = 1 # prices goes up
class_negative = 0 # price goes down
target_price_up = xr.where(price_future > price_current, class_positive, class_negative)
return target_price_up
# displaying the target classes:
my_targetclass = get_target_classes(stock_data)
display(my_targetclass.to_pandas())
def get_model():
"""This is a constructor for the ML model (Bayesian Ridge) which can be easily
modified for using different models.
"""
model = linear_model.BayesianRidge()
return model
# Create and train the models working on an asset-by-asset basis.
asset_name_all = stock_data.coords["asset"].values
models = dict()
for asset_name in asset_name_all:
# drop missing values:
target_cur = my_targetclass.sel(asset=asset_name).dropna("time", "any")
features_cur = my_features.sel(asset=asset_name).dropna("time", "any")
# align features and targets:
target_for_learn_df, feature_for_learn_df = xr.align(target_cur, features_cur, join="inner")
if len(features_cur.time) < 10:
# not enough points for training
continue
model = get_model()
try:
model.fit(feature_for_learn_df.values, target_for_learn_df)
models[asset_name] = model
except:
logging.exception("model training failed")
print(models)
# Showing which features are more important in predicting:
importance = models["NAS:AAPL"].coef_
importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
plt.bar([x for x in range(len(importance))], importance)
plt.show()
# Performs prediction and generates output weights:
asset_name_all = stock_data.coords["asset"].values
weights = xr.zeros_like(stock_data.sel(field="close"))
for asset_name in asset_name_all:
if asset_name in models:
model = models[asset_name]
features_all = my_features
features_cur = features_all.sel(asset=asset_name).dropna("time", "any")
if len(features_cur.time) < 1:
continue
try:
weights.loc[dict(asset=asset_name, time=features_cur.time.values)] = model.predict(features_cur.values)
except KeyboardInterrupt as e:
raise e
except:
logging.exception("model prediction failed")
print(weights)
def get_sharpe(stock_data, weights):
"""Calculates the Sharpe ratio"""
rr = qnstats.calc_relative_return(stock_data, weights)
sharpe = qnstats.calc_sharpe_ratio_annualized(rr).values[-1]
return sharpe
sharpe = get_sharpe(stock_data, weights)
sharpe
The sharpe ratio using the method above follows from forward looking. Predictions for (let us say) 2017 know about the relation between features and targets in 2020. Let us visualize the results:
import qnt.graph as qngraph
statistics = qnstats.calc_stat(stock_data, weights)
display(statistics.to_pandas().tail())
performance = statistics.to_pandas()["equity"]
qngraph.make_plot_filled(performance.index, performance, name="PnL (Equity)", type="log")
display(statistics[-1:].sel(field = ["sharpe_ratio"]).transpose().to_pandas())
# check for correlations with existing strategies:
qnstats.print_correlation(weights,stock_data)
"""R2 (coefficient of determination) regression score function."""
r2_score(my_targetclass, weights, multioutput="variance_weighted")
"""The explained variance score explains the dispersion of errors of a given dataset"""
explained_variance_score(my_targetclass, weights, multioutput="uniform_average")
"""The explained variance score explains the dispersion of errors of a given dataset"""
mean_absolute_error(my_targetclass, weights)
Let us now use the Quantiacs backtester for avoiding forward looking.
The backtester performs some transformations: it trains the model on one slice of data (using only data from the past) and predicts the weights for the following slice on a rolling basis:
def train_model(data):
"""Create and train the model working on an asset-by-asset basis."""
asset_name_all = data.coords["asset"].values
features_all = get_features(data)
target_all = get_target_classes(data)
models = dict()
for asset_name in asset_name_all:
# drop missing values:
target_cur = target_all.sel(asset=asset_name).dropna("time", "any")
features_cur = features_all.sel(asset=asset_name).dropna("time", "any")
target_for_learn_df, feature_for_learn_df = xr.align(target_cur, features_cur, join="inner")
if len(features_cur.time) < 10:
continue
model = get_model()
try:
model.fit(feature_for_learn_df.values, target_for_learn_df)
models[asset_name] = model
except:
logging.exception("model training failed")
return models
def predict_weights(models, data):
"""The model predicts if the price is going up or down.
The prediction is performed for several days in order to speed up the evaluation."""
asset_name_all = data.coords["asset"].values
weights = xr.zeros_like(data.sel(field="close"))
for asset_name in asset_name_all:
if asset_name in models:
model = models[asset_name]
features_all = get_features(data)
features_cur = features_all.sel(asset=asset_name).dropna("time", "any")
if len(features_cur.time) < 1:
continue
try:
weights.loc[dict(asset=asset_name, time=features_cur.time.values)] = model.predict(features_cur.values)
except KeyboardInterrupt as e:
raise e
except:
logging.exception("model prediction failed")
return weights
# Calculate weights using the backtester:
weights = qnbt.backtest_ml(
train = train_model,
predict = predict_weights,
train_period = 2 *365, # the data length for training in calendar days
retrain_interval = 10 *365, # how often we have to retrain models (calendar days)
retrain_interval_after_submit = 1, # how often retrain models after submission during evaluation (calendar days)
predict_each_day = False, # Is it necessary to call prediction for every day during backtesting?
# Set it to True if you suspect that get_features is looking forward.
competition_type = "stocks_nasdaq100", # competition type
lookback_period = 365, # how many calendar days are needed by the predict function to generate the output
start_date = "2005-01-01", # backtest start date
analyze = True,
build_plots = True # do you need the chart?
)