Training a Model
In the last notebook, we learned how to write stock indicators in PyBroker. Indicators are a good starting point for developing a trading strategy. But to create a successful strategy, it is likely that a more sophisticated approach using predictive modeling will be needed.
Fortunately, one of the main features of PyBroker is the ability to train and backtest machine learning models. These models can utilize indicators as features to make more accurate predictions about market movements. Once trained, these models can be backtested using a popular technique known as Walkforward Analysis, which simulates how a strategy would perform during actual trading.
We’ll explain Walkforward Analysis more in depth later in this notebook. But first, let’s get started with some needed imports!
[1]:
import numpy as np
import pandas as pd
import pybroker
from numba import njit
from pybroker import Strategy, StrategyConfig, YFinance
As with DataSource and Indicator data, PyBroker can also cache trained models to disk. You can enable caching for all three by calling pybroker.enable_caches:
[2]:
pybroker.enable_caches('walkforward_strategy')
In the last notebook, we implemented an indicator that calculates the close-minus-moving-average (CMMA) using NumPy and Numba. Here’s the code for the CMMA indicator again:
[3]:
def cmma(bar_data, lookback):
@njit # Enable Numba JIT.
def vec_cmma(values):
# Initialize the result array.
n = len(values)
out = np.array([np.nan for _ in range(n)])
# For all bars starting at lookback:
for i in range(lookback, n):
# Calculate the moving average for the lookback.
ma = 0
for j in range(i - lookback, i):
ma += values[j]
ma /= lookback
# Subtract the moving average from value.
out[i] = values[i] - ma
return out
# Calculate for close prices.
return vec_cmma(bar_data.close)
cmma_20 = pybroker.indicator('cmma_20', cmma, lookback=20)
Train and Backtest
Next, we want to build a model that predicts the next day’s return using the 20-day CMMA. Using simple linear regression is a good approach to begin experimenting with. Below we import a LinearRegression model from scikit-learn:
[4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
We create a train_slr
function to train the LinearRegression
model:
[5]:
def train_slr(symbol, train_data, test_data):
# Train
# Previous day close prices.
train_prev_close = train_data['close'].shift(1)
# Calculate daily returns.
train_daily_returns = (train_data['close'] - train_prev_close) / train_prev_close
# Predict next day's return.
train_data['pred'] = train_daily_returns.shift(-1)
train_data = train_data.dropna()
# Train the LinearRegession model to predict the next day's return
# given the 20-day CMMA.
X_train = train_data[['cmma_20']]
y_train = train_data[['pred']]
model = LinearRegression()
model.fit(X_train, y_train)
# Test
test_prev_close = test_data['close'].shift(1)
test_daily_returns = (test_data['close'] - test_prev_close) / test_prev_close
test_data['pred'] = test_daily_returns.shift(-1)
test_data = test_data.dropna()
X_test = test_data[['cmma_20']]
y_test = test_data[['pred']]
# Make predictions from test data.
y_pred = model.predict(X_test)
# Print goodness of fit.
r2 = r2_score(y_test, np.squeeze(y_pred))
print(symbol, f'R^2={r2}')
# Return the trained model and columns to use as input data.
return model, ['cmma_20']
The train_slr
function uses the 20-day CMMA as the input feature, or predictor, for the LinearRegression
model. The function then fits the LinearRegression
model to the training data for that stock symbol.
After fitting the model, the function uses the testing data to evaluate the model’s accuracy, specifically by computing the R-squared score. The R-squared score provides a measure of how well the LinearRegression
model fits the testing data.
The final output of the train_slr
function is the trained LinearRegression
model specifically for that stock symbol, along with the cmma_20
column, which is to be used as input data when making predictions. PyBroker will use this model to predict the next day’s return of the stock during the backtest. The train_slr
function will be called for each stock symbol, and the trained models will be used to predict the next day’s return for each individual stock.
Once the function to train the model has been defined, it needs to be registered with PyBroker. This is done by creating a new ModelSource instance using the pybroker.model function. The arguments to this function are the name of the model ('slr'
in this case), the function that will train the model
(train_slr
), and a list of indicators to use as inputs for the model (in this case, cmma_20
).
[6]:
model_slr = pybroker.model('slr', train_slr, indicators=[cmma_20])
To create a trading strategy that uses the trained model, a new Strategy object is created using the YFinance data source, and specifying the start and end dates for the backtest period.
[7]:
config = StrategyConfig(bootstrap_sample_size=100)
strategy = Strategy(YFinance(), '3/1/2017', '3/1/2022', config)
strategy.add_execution(None, ['NVDA', 'AMD'], models=model_slr)
The add_execution method is then called on the Strategy object to specify the details of the trading execution. In this case, a None
value is passed as the first argument, which means that no trading function will be used during the backtest.
The last step is to run the backtest by calling the backtest method on the Strategy
object, with a train_size
of 0.5
to specify that the model should be trained on the first half of the backtest data, and tested on the second half.
[8]:
strategy.backtest(train_size=0.5)
Backtesting: 2017-03-01 00:00:00 to 2022-03-01 00:00:00
Loading bar data...
[*********************100%***********************] 2 of 2 completed
Loaded bar data: 0:00:00
Computing indicators...
100% (2 of 2) |##########################| Elapsed Time: 0:00:01 Time: 0:00:01
Train split: 2017-03-01 00:00:00 to 2019-08-28 00:00:00
AMD R^2=-0.006808549721842416
NVDA R^2=-0.004416132743176426
Finished training models: 0:00:00
Finished backtest: 0:00:01
Walkforward Analysis
PyBroker employs a powerful algorithm known as Walkforward Analysis to perform backtesting. The algorithm partitions the backtest data into a fixed number of time windows, each containing a train-test split of data.
The Walkforward Analysis algorithm then proceeds to “walk forward” in time, in the same manner that a trading strategy would be executed in the real world. The model is first trained on the earliest window and then evaluated on the test data in that window.
As the algorithm moves forward to evaluate the next window in time, the test data from the previous window is added to the training data. This process continues until all of the time windows are evaluated.
By using this approach, the Walkforward Analysis algorithm is able to simulate the real-world performance of a trading strategy, and produce more reliable and accurate backtesting results.
Let’s consider a trading strategy that generates buy and sell signals from the LinearRegression model that we trained earlier. The strategy is implemented as the hold_long
function:
[9]:
def hold_long(ctx):
if not ctx.long_pos():
# Buy if the next bar is predicted to have a positive return:
if ctx.preds('slr')[-1] > 0:
ctx.buy_shares = 100
else:
# Sell if the next bar is predicted to have a negative return:
if ctx.preds('slr')[-1] < 0:
ctx.sell_shares = 100
strategy.clear_executions()
strategy.add_execution(hold_long, ['NVDA', 'AMD'], models=model_slr)
The hold_long
function opens a long position when the model predicts a positive return for the next bar, and then closes the position when the model predicts a negative return.
The ctx.preds(‘slr’) method is used to access the predictions made by the 'slr'
model for the current stock symbol being executed in the function (NVDA or AMD). The predictions are stored in a NumPy array, and the most recent prediction for the current stock symbol is accessed using ctx.preds('slr')[-1]
, which
is the model’s prediction of the next bar’s return.
Now that we have defined a trading strategy and registered the 'slr'
model, we can run the backtest using the Walkforward Analysis algorithm.
The backtest is run by calling the walkforward method on the Strategy
object, with the desired number of time windows and train/test split ratio. In this case, we will use 3 time windows, each with a 50/50 train-test split.
Additionally, since our 'slr'
model makes a prediction for one bar in the future, we need to specify the lookahead
parameter as 1
. This is necessary to ensure that training data does not leak into the test boundary. The lookahead
parameter should always be set to the number of bars in the future being predicted.
[10]:
result = strategy.walkforward(
warmup=20,
windows=3,
train_size=0.5,
lookahead=1,
calc_bootstrap=True
)
Backtesting: 2017-03-01 00:00:00 to 2022-03-01 00:00:00
Loaded cached bar data.
Loaded cached indicator data.
Train split: 2017-03-06 00:00:00 to 2018-06-01 00:00:00
AMD R^2=-0.007950114729117885
NVDA R^2=-0.04203364470839133
Finished training models: 0:00:00
Test split: 2018-06-04 00:00:00 to 2019-08-30 00:00:00
100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time: 0:00:00
Train split: 2018-06-04 00:00:00 to 2019-08-30 00:00:00
AMD R^2=0.0006422677593683757
NVDA R^2=-0.023591728578221893
Finished training models: 0:00:00
Test split: 2019-09-03 00:00:00 to 2020-11-27 00:00:00
100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time: 0:00:00
Train split: 2019-09-03 00:00:00 to 2020-11-27 00:00:00
AMD R^2=-0.015508227883924253
NVDA R^2=-0.4567200095787838
Finished training models: 0:00:00
Test split: 2020-11-30 00:00:00 to 2022-02-28 00:00:00
100% (314 of 314) |######################| Elapsed Time: 0:00:00 Time: 0:00:00
Calculating bootstrap metrics: sample_size=100, samples=10000...
Calculated bootstrap metrics: 0:00:03
Finished backtest: 0:00:04
During the backtesting process using the Walkforward Analysis algorithm, the 'slr'
model is trained on a given window’s training data, and then the hold_long
function runs on the same window’s test data.
The model is trained on the training data to make predictions about the next day’s returns. The hold_long
function then uses these predictions to make buy or sell decisions for the current day’s trading session.
By evaluating the performance of the trading strategy on the test data for each window, we can see how well the strategy is likely to perform in real-world trading conditions. This process is repeated for each time window in the backtest, using the results to evaluate the overall performance of the trading strategy:
[11]:
result.metrics_df
[11]:
name | value | |
---|---|---|
0 | trade_count | 43.000000 |
1 | initial_market_value | 100000.000000 |
2 | end_market_value | 109831.000000 |
3 | total_pnl | 12645.000000 |
4 | unrealized_pnl | -2814.000000 |
5 | total_return_pct | 12.645000 |
6 | total_profit | 20566.000000 |
7 | total_loss | -7921.000000 |
8 | total_fees | 0.000000 |
9 | max_drawdown | -14177.000000 |
10 | max_drawdown_pct | -12.272121 |
11 | win_rate | 76.744186 |
12 | loss_rate | 23.255814 |
13 | winning_trades | 33.000000 |
14 | losing_trades | 10.000000 |
15 | avg_pnl | 294.069767 |
16 | avg_return_pct | 5.267674 |
17 | avg_trade_bars | 25.488372 |
18 | avg_profit | 623.212121 |
19 | avg_profit_pct | 9.237576 |
20 | avg_winning_trade_bars | 19.151515 |
21 | avg_loss | -792.100000 |
22 | avg_loss_pct | -7.833000 |
23 | avg_losing_trade_bars | 46.400000 |
24 | largest_win | 2715.000000 |
25 | largest_win_pct | 9.320000 |
26 | largest_win_bars | 2.000000 |
27 | largest_loss | -5054.000000 |
28 | largest_loss_pct | -16.140000 |
29 | largest_loss_bars | 43.000000 |
30 | max_wins | 13.000000 |
31 | max_losses | 2.000000 |
32 | sharpe | 0.023425 |
33 | profit_factor | 1.094471 |
34 | ulcer_index | 1.177116 |
35 | upi | 0.009193 |
36 | equity_r2 | 0.772082 |
37 | std_error | 4191.846954 |
[12]:
result.bootstrap.conf_intervals
[12]:
lower | upper | ||
---|---|---|---|
name | conf | ||
Profit Factor | 97.5% | 0.259819 | 1.296660 |
95% | 0.303435 | 1.151299 | |
90% | 0.373167 | 1.002514 | |
Sharpe Ratio | 97.5% | -0.359565 | 0.050383 |
95% | -0.332180 | 0.018154 | |
90% | -0.276757 | -0.018004 |
[13]:
result.bootstrap.drawdown_conf
[13]:
amount | percent | |
---|---|---|
conf | ||
99.9% | -13917.50 | -12.190522 |
99% | -11058.25 | -9.693729 |
95% | -8380.25 | -7.480589 |
90% | -7129.00 | -6.403027 |
In summary, we have now completed the process of training and backtesting a linear regression model using PyBroker, with the help of Walkforward Analysis. The metrics that we have seen are based on the test data from all of the time windows in the backtest. Although our trading strategy needs to be improved, we have gained a good understanding of how to train and evaluate a model in PyBroker.
Please keep in mind that before conducting regression analysis, it is important to verify certain assumptions such as homoscedasticity, normality of residuals, etc. I have not provided the details for these assumptions here for the sake of brevity and recommend that you perform this exercise on your own.
We are also not limited to just building linear regression models in PyBroker. We can train other model types such as gradient boosted machines, neural networks, or any other architecture that we choose. This flexibility allows us to explore and experiment with various models and approaches to find the best performing model for our specific trading goals.
PyBroker also offers customization options, such as the ability to specify an input_data_fn for our model in case we need to customize how its input data is built. This would be required when constructing input for autoregressive models (i.e. ARMA or RNN) that use multiple past values to make predictions. Similarly, we can specify our own
predict_fn to customize how predictions are made (by default, the model’s predict
function is called).
With this knowledge, you can start building and testing your own models and trading strategies in PyBroker, and begin exploring the vast possibilities that this framework offers!