strategy

Machine Learning - LSTM - Multiple Features¶

This example shows how to use neural networks for writing a trading system on stocks.

You can clone and edit this example there (tab Examples).

This example utilizes a Long Short Term Memory (LSTM) Neural Network to predict whether the price will go up or down.

Important! Before further development, you need to run the ./init.py file once to install the PyTorch dependency.

Strategy Idea: We will go long on 'NAS:AAPL' based on the predictions of the LSTM NN, depending on how confident the NN is that the price is moving up.

Feature for Learning - logarithm of prices (close, open, high).

In [1]:

import xarray as xr
import qnt.data as qndata
import qnt.backtester as qnbt
import qnt.ta as qnta
import qnt.stats as qns
import qnt.graph as qngraph
import qnt.output as qnout
import numpy as np
import pandas as pd
import torch
from torch import nn, optim
import random

asset_name_all = ['NAS:AAPL']
lookback_period = 155
train_period = 100


class LSTM(nn.Module):
    """
    Class to define our LSTM network.
    """

    def __init__(self, input_dim=3, hidden_layers=64):
        super(LSTM, self).__init__()
        self.hidden_layers = hidden_layers
        self.lstm1 = nn.LSTMCell(input_dim, self.hidden_layers)
        self.lstm2 = nn.LSTMCell(self.hidden_layers, self.hidden_layers)
        self.linear = nn.Linear(self.hidden_layers, 1)

    def forward(self, y):
        outputs = []
        n_samples = y.size(0)
        h_t = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        c_t = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        h_t2 = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        c_t2 = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)

        for time_step in range(y.size(1)):
            x_t = y[:, time_step, :]  # Ensure x_t is [batch, input_dim]

            h_t, c_t = self.lstm1(x_t, (h_t, c_t))
            h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
            output = self.linear(h_t2)
            outputs.append(output.unsqueeze(1))

        outputs = torch.cat(outputs, dim=1).squeeze(-1)
        return outputs


def get_model():
    def set_seed(seed_value=42):
        """Set seed for reproducibility."""
        random.seed(seed_value)
        np.random.seed(seed_value)
        torch.manual_seed(seed_value)
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)  # if you are using multi-GPU.
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

    set_seed(42)
    model = LSTM(input_dim=3)
    return model


def get_features(data):
    close_price = data.sel(field="close").ffill('time').bfill('time').fillna(1)
    open_price = data.sel(field="open").ffill('time').bfill('time').fillna(1)
    high_price = data.sel(field="high").ffill('time').bfill('time').fillna(1)
    log_close = np.log(close_price)
    log_open = np.log(open_price)
    features = xr.concat([log_close, log_open, high_price], "feature")
    return features


def get_target_classes(data):
    price_current = data.sel(field='open')
    price_future = qnta.shift(price_current, -1)

    class_positive = 1  # prices goes up
    class_negative = 0  # price goes down

    target_price_up = xr.where(price_future > price_current, class_positive, class_negative)
    return target_price_up


def load_data(period):
    return qndata.stocks.load_ndx_data(tail=period, assets=asset_name_all)


def train_model(data):
    features_all = get_features(data)
    target_all = get_target_classes(data)
    models = dict()

    for asset_name in asset_name_all:
        model = get_model()
        target_cur = target_all.sel(asset=asset_name).dropna('time', 'any')
        features_cur = features_all.sel(asset=asset_name).dropna('time', 'any')
        target_for_learn_df, feature_for_learn_df = xr.align(target_cur, features_cur, join='inner')
        criterion = nn.MSELoss()
        optimiser = optim.LBFGS(model.parameters(), lr=0.08)
        epochs = 1
        for i in range(epochs):
            def closure():
                optimiser.zero_grad()
                feature_data = feature_for_learn_df.transpose('time', 'feature').values
                in_ = torch.tensor(feature_data, dtype=torch.float32).unsqueeze(0)
                out = model(in_)
                target = torch.zeros(1, len(target_for_learn_df.values))
                target[0, :] = torch.tensor(np.array(target_for_learn_df.values))
                loss = criterion(out, target)
                loss.backward()
                return loss

            optimiser.step(closure)
        models[asset_name] = model
    return models


def predict(models, data):
    weights = xr.zeros_like(data.sel(field='close'))
    for asset_name in asset_name_all:
        features_all = get_features(data)
        features_cur = features_all.sel(asset=asset_name).dropna('time', 'any')
        if len(features_cur.time) < 1:
            continue
        feature_data = features_cur.transpose('time', 'feature').values
        in_ = torch.tensor(feature_data, dtype=torch.float32).unsqueeze(0)
        out = models[asset_name](in_)
        prediction = out.detach()[0]
        weights.loc[dict(asset=asset_name, time=features_cur.time.values)] = prediction
    return weights

Multi-pass Version for Development and Testing Strategy¶

In [2]:

weights = qnbt.backtest_ml(
    load_data=load_data,
    train=train_model,
    predict=predict,
    train_period=train_period,
    retrain_interval=360,
    retrain_interval_after_submit=1,
    predict_each_day=False,
    competition_type='stocks_nasdaq100',
    lookback_period=lookback_period,
    start_date='2006-01-01',
    build_plots=True
)

Run the last iteration...

100% (367973 of 367973) |################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (39443 of 39443) |##################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (7804 of 7804) |####################| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/1 0s
Data loaded 0s

100% (3384 of 3384) |####################| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/1 0s
Data loaded 0s
Output cleaning...
fix uniq
ffill if the current price is None...
Check liquidity...
Ok.
Check missed dates...
Ok.
Normalization...
Output cleaning is complete.
Write output: /root/fractions.nc.gz
State saved.
---
Run First Iteration...

100% (7872 of 7872) |####################| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/1 0s
Data loaded 0s
---
Run all iterations...
Load data...

100% (325636 of 325636) |################| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/1 0s
Data loaded 0s

100% (313600 of 313600) |################| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/1 0s
Data loaded 0s
Backtest...

100% (39443 of 39443) |##################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (13091584 of 13091584) |############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 1/6 1s

100% (13094400 of 13094400) |############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 2/6 2s

100% (13091552 of 13091552) |############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 3/6 3s

100% (13091464 of 13091464) |############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 4/6 4s

100% (13091464 of 13091464) |############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 5/6 5s

100% (8635052 of 8635052) |##############| Elapsed Time: 0:00:00 Time:  0:00:00

fetched chunk 6/6 6s
Data loaded 6s
Output cleaning...
fix uniq
ffill if the current price is None...
Check liquidity...
Ok.
Check missed dates...
Ok.
Normalization...
Output cleaning is complete.
Write output: /root/fractions.nc.gz
State saved.
---
Analyze results...
Check...
Check liquidity...
Ok.
Check missed dates...
Ok.
Check the sharpe ratio...
Period: 2006-01-01 - 2024-04-17
Sharpe Ratio = 0.7934054396905317

ERROR! The Sharpe Ratio is too low. 0.7934054396905317 < 1
Improve the strategy and make sure that the in-sample Sharpe Ratio more than 1.

---
Align...
Calc global stats...
---
Calc stats per asset...
Build plots...
---
Output:

asset	NAS:AAL	NAS:AAPL	NAS:ABNB	NAS:ADBE	NAS:ADI	NAS:ADP	NAS:ADSK	NAS:AEP	NAS:AKAM	NAS:ALGN
time
2024-04-04	0.0	0.597755	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-05	0.0	0.597685	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-08	0.0	0.597571	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-09	0.0	0.597559	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-10	0.0	0.597480	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-11	0.0	0.597837	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-12	0.0	0.598206	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-15	0.0	0.598328	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-16	0.0	0.598239	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2024-04-17	0.0	0.597997	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Stats:

field	equity	relative_return	volatility	underwater	max_drawdown	sharpe_ratio	mean_return	bias	instruments	avg_turnover	avg_holding_time
time
2024-04-04	9.093603	-0.002932	0.161648	-0.090087	-0.359607	0.796188	0.128702	1.0	1.0	0.007607	813.761536
2024-04-05	9.118065	0.002690	0.161631	-0.087639	-0.359607	0.797115	0.128838	1.0	1.0	0.007606	813.761536
2024-04-08	9.081673	-0.003991	0.161617	-0.091281	-0.359607	0.795469	0.128561	1.0	1.0	0.007605	813.777464
2024-04-09	9.120946	0.004324	0.161602	-0.087351	-0.359607	0.797012	0.128798	1.0	1.0	0.007604	813.803695
2024-04-10	9.060197	-0.006660	0.161593	-0.093429	-0.359607	0.794314	0.128355	1.0	1.0	0.007603	813.806337
2024-04-11	9.294687	0.025881	0.161684	-0.069966	-0.359607	0.803458	0.129906	1.0	1.0	0.007605	813.824498
2024-04-12	9.341877	0.005077	0.161670	-0.065244	-0.359607	0.805281	0.130190	1.0	1.0	0.007607	813.824498
2024-04-15	9.220010	-0.013045	0.161684	-0.077438	-0.359607	0.800003	0.129347	1.0	1.0	0.007608	813.824498
2024-04-16	9.114859	-0.011405	0.161690	-0.087960	-0.359607	0.795401	0.128609	1.0	1.0	0.007608	813.824498
2024-04-17	9.070252	-0.004894	0.161678	-0.092423	-0.359607	0.793405	0.128276	1.0	1.0	0.007607	918.838410

---

100% (4603 of 4603) |####################| Elapsed Time: 0:02:46 Time:  0:02:46

Single-pass Version for Participation in the Contest¶

Comment the code above and uncomment the code below.

def print_stats(data, weights):
    stats = qns.calc_stat(data, weights)
    display(stats.to_pandas().tail())
    performance = stats.to_pandas()["equity"]
    qngraph.make_plot_filled(performance.index, performance, name="PnL (Equity)", type="log")


data_train = load_data(train_period)
models = train_model(data_train)

data_predict = load_data(lookback_period)
weights_predict = predict(models, data_predict)

print_stats(data_predict, weights_predict)

qnout.write(weights_predict) # To participate in the competition, save this code in a separate cell.

An Example of How to Evaluate the Performance of a Machine Learning Model Over a Specific Time Period¶

data = qndata.stocks.load_ndx_data(min_date="2023-07-20", assets=asset_name_all)

models = train_model(data.sel(time=slice("2023-09-25", "2024-01-02")))
weights_slice = predict(models, data.sel(time=slice("2023-09-25", "2024-01-02")))

print_stats(data, weights_slice.sel(time=slice("2023-09-25", "2024-01-02")))

Machine Learning Model Strategy for Competitive Submissions

To enhance your machine learning-based strategy for competitive submissions, consider the following guidelines tailored for efficiency and robustness:

Model Retraining Frequency¶

Your configuration to retrain the model daily (retrain_interval_after_submit=1) after competition submission is noted. For a more streamlined approach, adjust your strategy to a single-pass mode, conducive to the competition's environment. Utilize the available precheck feature for a preliminary quality assessment of your model.

Acceleration Techniques¶

To expedite the development process, you might explore:

Model Simplification: Opt for less complex machine learning models to reduce computational demands.
Local Development Enhancements: Utilize a high-performance computer locally or deploy your script on a potent server for accelerated computations.
Data Volume Reduction: Limit the dataset size to hasten model training and evaluation.
Condensed Testing Phases: Shorten the evaluation timeframe by focusing on recent performance metrics, such as examining the model's financial outcomes over the past year.

Data Preparation and Feature Engineering¶

Pre-calculated Indicators: Employ pre-calculated technical indicators like Exponential Moving Averages (EMA) to enrich your features without the risk of lookahead bias. Example: g_ema = qnta.ema(data_all.sel(field="high"), 15) ensures indicators are prepared ahead of the model training phase.