Getting Started with Data Sources

Welcome to PyBroker! The best place to start is to learn about DataSources. A DataSource is a class that can fetch data from external sources, which you can then use to backtest your trading strategies.

Yahoo Finance

One of the built-in DataSources in PyBroker is Yahoo Finance. To use it, you can import YFinance:

[1]:
from pybroker import YFinance

yfinance = YFinance()
df = yfinance.query(['AAPL', 'MSFT'], start_date='3/1/2021', end_date='3/1/2022')
df
Loading bar data...
[*********************100%***********************]  2 of 2 completed
Loaded bar data: 0:00:00

[1]:
date symbol open high low close volume adj_close
0 2021-03-01 AAPL 123.750000 127.930000 122.790001 127.790001 116307900 126.095642
1 2021-03-01 MSFT 235.899994 237.470001 233.149994 236.940002 25324000 232.234650
2 2021-03-02 AAPL 128.410004 128.720001 125.010002 125.120003 102260900 123.461052
3 2021-03-02 MSFT 237.009995 237.300003 233.449997 233.869995 22812500 229.225616
4 2021-03-03 AAPL 124.809998 125.709999 121.839996 122.059998 112966300 120.441605
... ... ... ... ... ... ... ... ...
501 2022-02-24 MSFT 272.510010 295.160004 271.519989 294.589996 56989700 291.091736
502 2022-02-25 AAPL 163.839996 165.119995 160.869995 164.850006 91974200 163.631073
503 2022-02-25 MSFT 295.140015 297.630005 291.649994 297.309998 32546700 293.779388
504 2022-02-28 AAPL 163.059998 165.419998 162.429993 165.119995 95056600 163.899078
505 2022-02-28 MSFT 294.309998 299.140015 293.000000 298.790009 34627500 295.241852

506 rows × 8 columns

The above code queries data for AAPL and MSFT stocks, and returns a Pandas DataFrame with the results.

Caching Data

If you want to speed up your data retrieval, you can cache your queries using PyBroker’s caching system. You can enable caching by calling pybroker.enable_data_source_cache(‘name’) where name is the name of the cache you want to use:

[2]:
import pybroker

pybroker.enable_data_source_cache('yfinance')
[2]:
<diskcache.core.Cache at 0x7fb0b1aaf0d0>

The next call to query will cache the returned data to disk. Each unique combination of ticker symbol and date range will be cached separately:

[3]:
yfinance.query(['TSLA', 'IBM'], '3/1/2021', '3/1/2022')
Loading bar data...
[*********************100%***********************]  2 of 2 completed
Loaded bar data: 0:00:00

[3]:
date symbol open high low close volume adj_close
0 2021-03-01 IBM 115.057358 116.940727 114.588913 115.430206 5977367 103.409409
1 2021-03-01 TSLA 230.036667 239.666672 228.350006 239.476669 81408600 239.476669
2 2021-03-02 IBM 115.430206 116.539200 114.971321 115.038239 4732418 103.058266
3 2021-03-02 TSLA 239.426666 240.369995 228.333328 228.813339 71196600 228.813339
4 2021-03-03 IBM 115.200768 117.237091 114.703636 116.978966 7744898 104.796883
... ... ... ... ... ... ... ... ...
501 2022-02-24 TSLA 233.463333 267.493347 233.333328 266.923340 135322200 266.923340
502 2022-02-25 IBM 122.050003 124.260002 121.449997 124.180000 4460900 116.693390
503 2022-02-25 TSLA 269.743347 273.166656 260.799988 269.956665 76067700 269.956665
504 2022-02-28 IBM 122.209999 123.389999 121.040001 122.510002 6757300 115.124062
505 2022-02-28 TSLA 271.670013 292.286682 271.570007 290.143341 99006900 290.143341

506 rows × 8 columns

Calling query again with the same ticker symbols and date range returns the cached data:

[4]:
df = yfinance.query(['TSLA', 'IBM'], '3/1/2021', '3/1/2022')
df
Loaded cached bar data.

[4]:
date symbol open high low close volume adj_close
0 2021-03-01 TSLA 230.036667 239.666672 228.350006 239.476669 81408600 239.476669
1 2021-03-02 TSLA 239.426666 240.369995 228.333328 228.813339 71196600 228.813339
2 2021-03-03 TSLA 229.330002 233.566666 217.236664 217.733337 90624000 217.733337
3 2021-03-04 TSLA 218.600006 222.816666 200.000000 207.146667 197758500 207.146667
4 2021-03-05 TSLA 208.686661 209.279999 179.830002 199.316666 268189500 199.316666
... ... ... ... ... ... ... ... ...
248 2022-02-22 IBM 124.199997 125.000000 122.680000 123.919998 5349700 116.449051
249 2022-02-23 IBM 124.379997 124.699997 121.870003 122.070000 4086400 114.710587
250 2022-02-24 IBM 120.000000 122.099998 118.809998 121.970001 6563200 114.616615
251 2022-02-25 IBM 122.050003 124.260002 121.449997 124.180000 4460900 116.693390
252 2022-02-28 IBM 122.209999 123.389999 121.040001 122.510002 6757300 115.124062

506 rows × 8 columns

You can clear your cache using pybroker.clear_data_source_cache:

[5]:
pybroker.clear_data_source_cache()

Or disable caching altogether using pybroker.disable_data_source_cache:

[6]:
pybroker.disable_data_source_cache()

Note that these calls should be made after first calling pybroker.enable_data_source_cache.

Alpaca

PyBroker also includes an Alpaca DataSource for fetching stock data. To use it, you can import Alpaca and provide your API key and secret:

[7]:
from pybroker import Alpaca
import os

alpaca = Alpaca(os.environ['ALPACA_API_KEY'], os.environ['ALPACA_API_SECRET'])

You can query Alpaca for stock data using the same syntax as with Yahoo Finance, but Alpaca also supports querying data by different timeframes. For example, to query 1 minute data:

[8]:
df = alpaca.query(
    ['AAPL', 'MSFT'],
    start_date='3/1/2021',
    end_date='4/1/2021',
    timeframe='1m'
)
df
Loading bar data...
Loaded bar data: 0:00:06

[8]:
date symbol open high low close volume vwap
0 2021-03-01 04:00:00-05:00 AAPL 124.30 124.56 124.30 124.50 12267.0 124.433365
1 2021-03-01 04:00:00-05:00 MSFT 235.87 236.00 235.87 236.00 1429.0 235.938887
2 2021-03-01 04:01:00-05:00 AAPL 124.56 124.60 124.30 124.30 9439.0 124.481323
3 2021-03-01 04:01:00-05:00 MSFT 236.17 236.17 236.17 236.17 104.0 236.161538
4 2021-03-01 04:02:00-05:00 AAPL 124.00 124.05 123.78 123.78 4834.0 123.935583
... ... ... ... ... ... ... ... ...
33859 2021-03-31 19:57:00-04:00 MSFT 237.28 237.28 237.28 237.28 507.0 237.367870
33860 2021-03-31 19:58:00-04:00 AAPL 122.36 122.39 122.33 122.39 3403.0 122.360544
33861 2021-03-31 19:58:00-04:00 MSFT 237.40 237.40 237.35 237.35 636.0 237.378066
33862 2021-03-31 19:59:00-04:00 AAPL 122.39 122.45 122.38 122.45 5560.0 122.402606
33863 2021-03-31 19:59:00-04:00 MSFT 237.40 237.53 237.40 237.53 1163.0 237.473801

33864 rows × 8 columns

Alpaca Crypto

If you are interested in fetching cryptocurrency data, you can use AlpacaCrypto. Here’s an example of how to use it:

[9]:
from pybroker import AlpacaCrypto

crypto = AlpacaCrypto(
    os.environ['ALPACA_API_KEY'],
    os.environ['ALPACA_API_SECRET']
)
df = crypto.query('BTC/USD', start_date='1/1/2021', end_date='2/1/2021', timeframe='1h')
df
Loading bar data...
Loaded bar data: 0:00:01

[9]:
symbol date open high low close volume vwap trade_count
0 BTC/USD 2020-12-31 19:00:00-05:00 28973.0 29073.5 28775.0 29065.0 3.4437 28968.839097 72.0
1 BTC/USD 2020-12-31 20:00:00-05:00 29070.0 29481.0 29038.5 29404.5 4.6183 29359.399487 65.0
2 BTC/USD 2020-12-31 21:00:00-05:00 29528.0 29528.0 29218.0 29245.0 4.3423 29361.540923 42.0
3 BTC/USD 2020-12-31 22:00:00-05:00 29400.5 29400.5 29337.0 29367.5 0.3089 29400.447394 3.0
4 BTC/USD 2020-12-31 23:00:00-05:00 29449.0 29449.0 29136.5 29189.5 2.0245 29302.743369 34.0
... ... ... ... ... ... ... ... ... ...
736 BTC/USD 2021-01-31 15:00:00-05:00 32754.0 32939.0 32499.0 32893.0 153.3498 32622.897675 98.0
737 BTC/USD 2021-01-31 16:00:00-05:00 32887.0 32887.0 32570.0 32600.0 4.4939 32740.567859 106.0
738 BTC/USD 2021-01-31 17:00:00-05:00 32642.0 33100.0 32642.0 32993.0 52.4213 32717.239656 81.0
739 BTC/USD 2021-01-31 18:00:00-05:00 33059.0 33177.0 33030.0 33089.0 1.4816 33124.196207 59.0
740 BTC/USD 2021-01-31 19:00:00-05:00 32658.0 32658.0 32269.0 32557.0 251.3362 32442.383578 128.0

741 rows × 9 columns

In the above example, we’re querying for hourly data for the BTC/USD currency pair.

AKShare

PyBroker also includes an AKShare DataSource for fetching Chinese stock data. AKShare, a widely-used open-source package, is tailored for obtaining financial data, with a focus on the Chinese market. This free tool provides users with access to higher quality data compared to yfinance for the Chinese market. To use it, you can import AKShare:

[10]:
from pybroker.ext.data import AKShare

akshare = AKShare()
# You can substitute 000001.SZ with 000001, and it will still work!
df = akshare.query(['000001.SZ', '000002.SZ'], start_date='3/1/2021', end_date='3/1/2022')
df
Loading bar data...
Loaded bar data: 0:00:04

[10]:
date symbol open high low close volume
0 2021-03-01 000001.SZ 3640.44 3663.19 3581.93 3625.81 1125387
1 2021-03-01 000002.SZ 5123.19 5203.04 5017.15 5155.91 1280834
2 2021-03-02 000001.SZ 3653.44 3739.58 3594.93 3658.31 1473425
3 2021-03-02 000002.SZ 5110.10 5319.55 5097.00 5148.06 1220150
4 2021-03-03 000001.SZ 3646.94 3890.73 3627.43 3879.35 1919635
... ... ... ... ... ... ... ...
483 2022-02-25 000002.SZ 3565.36 3575.83 3499.90 3510.38 893970
484 2022-02-28 000001.SZ 2753.04 2756.29 2707.53 2728.66 723990
485 2022-02-28 000002.SZ 3506.45 3506.45 3439.69 3467.18 861954
486 2022-03-01 000001.SZ 2735.16 2761.16 2707.53 2756.29 935040
487 2022-03-01 000002.SZ 3468.49 3497.29 3454.09 3490.74 815806

488 rows × 7 columns

NOTE: If the above causes a Native library not available error and you still want to use AKShare, then see this issue for details on how to resolve it.

In the next notebook, we’ll take a look at how to use DataSources to backtest a simple trading strategy.