Getting Started with Data Sources

Welcome to PyBroker! The best place to start is to learn about DataSources. A DataSource is a class that can fetch data from external sources, which you can then use to backtest your trading strategies.

Yahoo Finance

One of the built-in DataSources in PyBroker is Yahoo Finance. To use it, you can import YFinance:

[1]:
from pybroker import YFinance

yfinance = YFinance()
df = yfinance.query(['AAPL', 'MSFT'], start_date='3/1/2021', end_date='3/1/2022')
df
Loading bar data...
[*********************100%%**********************]  2 of 2 completed
Loaded bar data: 0:00:00


[1]:
date symbol open high low close volume adj_close
0 2021-03-01 AAPL 123.750000 127.930000 122.790001 127.790001 116307900 125.599655
1 2021-03-01 MSFT 235.899994 237.470001 233.149994 236.940002 25324000 230.847702
2 2021-03-02 AAPL 128.410004 128.720001 125.010002 125.120003 102260900 122.975403
3 2021-03-02 MSFT 237.009995 237.300003 233.449997 233.869995 22812500 227.856628
4 2021-03-03 AAPL 124.809998 125.709999 121.839996 122.059998 112966300 119.967857
... ... ... ... ... ... ... ... ...
501 2022-02-24 MSFT 272.510010 295.160004 271.519989 294.589996 56989700 289.353271
502 2022-02-25 AAPL 163.839996 165.119995 160.869995 164.850006 91974200 162.987427
503 2022-02-25 MSFT 295.140015 297.630005 291.649994 297.309998 32546700 292.024872
504 2022-02-28 AAPL 163.059998 165.419998 162.429993 165.119995 95056600 163.254364
505 2022-02-28 MSFT 294.309998 299.140015 293.000000 298.790009 34627500 293.478607

506 rows × 8 columns

The above code queries data for AAPL and MSFT stocks, and returns a Pandas DataFrame with the results.

Caching Data

If you want to speed up your data retrieval, you can cache your queries using PyBroker’s caching system. You can enable caching by calling pybroker.enable_data_source_cache(‘name’) where name is the name of the cache you want to use:

[2]:
import pybroker

pybroker.enable_data_source_cache('yfinance')
[2]:
<diskcache.core.Cache at 0x7f3884390d60>

The next call to query will cache the returned data to disk. Each unique combination of ticker symbol and date range will be cached separately:

[3]:
yfinance.query(['TSLA', 'IBM'], '3/1/2021', '3/1/2022')
Loading bar data...
[*********************100%%**********************]  2 of 2 completed
Loaded bar data: 0:00:00


[3]:
date symbol open high low close volume adj_close
0 2021-03-01 IBM 115.057358 116.940727 114.588913 115.430206 5977367 100.173241
1 2021-03-01 TSLA 230.036667 239.666672 228.350006 239.476669 81408600 239.476669
2 2021-03-02 IBM 115.430206 116.539200 114.971321 115.038239 4732418 99.833076
3 2021-03-02 TSLA 239.426666 240.369995 228.333328 228.813339 71196600 228.813339
4 2021-03-03 IBM 115.200768 117.237091 114.703636 116.978966 7744898 101.517288
... ... ... ... ... ... ... ... ...
501 2022-02-24 TSLA 233.463333 267.493347 233.333328 266.923340 135322200 266.923340
502 2022-02-25 IBM 122.050003 124.260002 121.449997 124.180000 4460900 113.041489
503 2022-02-25 TSLA 269.743347 273.166656 260.799988 269.956665 76067700 269.956665
504 2022-02-28 IBM 122.209999 123.389999 121.040001 122.510002 6757300 111.521271
505 2022-02-28 TSLA 271.670013 292.286682 271.570007 290.143341 99006900 290.143341

506 rows × 8 columns

Calling query again with the same ticker symbols and date range returns the cached data:

[4]:
df = yfinance.query(['TSLA', 'IBM'], '3/1/2021', '3/1/2022')
df
Loaded cached bar data.

[4]:
date symbol open high low close volume adj_close
0 2021-03-01 IBM 115.057358 116.940727 114.588913 115.430206 5977367 100.173241
1 2021-03-02 IBM 115.430206 116.539200 114.971321 115.038239 4732418 99.833076
2 2021-03-03 IBM 115.200768 117.237091 114.703636 116.978966 7744898 101.517288
3 2021-03-04 IBM 116.634796 117.801147 113.537285 114.827919 8439651 99.650551
4 2021-03-05 IBM 115.334610 118.307838 114.961761 117.428299 7268968 101.907227
... ... ... ... ... ... ... ... ...
248 2022-02-22 TSLA 278.043335 285.576660 267.033325 273.843323 83288100 273.843323
249 2022-02-23 TSLA 276.809998 278.433319 253.520004 254.679993 95256900 254.679993
250 2022-02-24 TSLA 233.463333 267.493347 233.333328 266.923340 135322200 266.923340
251 2022-02-25 TSLA 269.743347 273.166656 260.799988 269.956665 76067700 269.956665
252 2022-02-28 TSLA 271.670013 292.286682 271.570007 290.143341 99006900 290.143341

506 rows × 8 columns

You can clear your cache using pybroker.clear_data_source_cache:

[5]:
pybroker.clear_data_source_cache()

Or disable caching altogether using pybroker.disable_data_source_cache:

[6]:
pybroker.disable_data_source_cache()

Note that these calls should be made after first calling pybroker.enable_data_source_cache.

Alpaca

PyBroker also includes an Alpaca DataSource for fetching stock data. To use it, you can import Alpaca and provide your API key and secret:

[7]:
from pybroker import Alpaca
import os

alpaca = Alpaca(os.environ['ALPACA_API_KEY'], os.environ['ALPACA_API_SECRET'])

You can query Alpaca for stock data using the same syntax as with Yahoo Finance, but Alpaca also supports querying data by different timeframes. For example, to query 1 minute data:

[8]:
df = alpaca.query(
    ['AAPL', 'MSFT'],
    start_date='3/1/2021',
    end_date='4/1/2021',
    timeframe='1m'
)
df
Loading bar data...
Loaded bar data: 0:00:05

[8]:
date symbol open high low close volume vwap
0 2021-03-01 04:00:00-05:00 AAPL 124.30 124.56 124.30 124.50 12267.0 124.433365
1 2021-03-01 04:00:00-05:00 MSFT 235.87 236.00 235.87 236.00 1429.0 235.938887
2 2021-03-01 04:01:00-05:00 AAPL 124.56 124.60 124.30 124.30 9439.0 124.481323
3 2021-03-01 04:01:00-05:00 MSFT 236.17 236.17 236.17 236.17 104.0 236.161538
4 2021-03-01 04:02:00-05:00 AAPL 124.00 124.05 123.78 123.78 4834.0 123.935583
... ... ... ... ... ... ... ... ...
33340 2021-03-31 19:57:00-04:00 MSFT 237.28 237.28 237.28 237.28 507.0 237.367870
33341 2021-03-31 19:58:00-04:00 AAPL 122.36 122.39 122.33 122.39 3403.0 122.360544
33342 2021-03-31 19:58:00-04:00 MSFT 237.40 237.40 237.35 237.35 636.0 237.378066
33343 2021-03-31 19:59:00-04:00 AAPL 122.39 122.45 122.38 122.45 5560.0 122.402606
33344 2021-03-31 19:59:00-04:00 MSFT 237.40 237.53 237.40 237.53 1163.0 237.473801

33345 rows × 8 columns

Alpaca Crypto

If you are interested in fetching cryptocurrency data, you can use AlpacaCrypto. Here’s an example of how to use it:

[9]:
from pybroker import AlpacaCrypto

crypto = AlpacaCrypto(
    os.environ['ALPACA_API_KEY'],
    os.environ['ALPACA_API_SECRET']
)
df = crypto.query('BTC/USD', start_date='1/1/2021', end_date='2/1/2021', timeframe='1h')
df
Loading bar data...
Loaded bar data: 0:00:06

[9]:
symbol date open high low close volume vwap trade_count
0 BTC/USD 2021-01-01 01:00:00-05:00 29255.71 29338.25 29153.55 29234.15 42.244289 29237.240312 1243.0
1 BTC/USD 2021-01-01 02:00:00-05:00 29235.61 29236.95 28905.00 29162.50 34.506038 29078.423185 1070.0
2 BTC/USD 2021-01-01 03:00:00-05:00 29162.50 29248.52 28948.86 29076.77 27.596804 29091.465155 1110.0
3 BTC/USD 2021-01-01 04:00:00-05:00 29075.31 29372.32 29058.05 29284.92 20.694200 29248.730924 880.0
4 BTC/USD 2021-01-01 05:00:00-05:00 29291.54 29400.00 29232.16 29286.63 16.617646 29338.609132 742.0
... ... ... ... ... ... ... ... ... ...
735 BTC/USD 2021-01-31 15:00:00-05:00 32837.67 32964.87 32528.54 32882.87 40.631122 32818.132855 2197.0
736 BTC/USD 2021-01-31 16:00:00-05:00 32889.01 32935.98 32554.59 32586.68 26.673190 32737.975296 1625.0
737 BTC/USD 2021-01-31 17:00:00-05:00 32599.00 33126.32 32599.00 32998.35 25.422568 32923.438893 1770.0
738 BTC/USD 2021-01-31 18:00:00-05:00 33000.00 33263.94 32957.10 33134.86 31.072017 33147.086803 2203.0
739 BTC/USD 2021-01-31 19:00:00-05:00 33134.03 33134.03 32303.44 32572.03 60.460424 32552.937863 2665.0

740 rows × 9 columns

In the above example, we’re querying for hourly data for the BTC/USD currency pair.

AKShare

PyBroker also includes an AKShare DataSource for fetching Chinese stock data. AKShare, a widely-used open-source package, is tailored for obtaining financial data, with a focus on the Chinese market. This free tool provides users with access to higher quality data compared to yfinance for the Chinese market. To use it, you can import AKShare:

[10]:
from pybroker.ext.data import AKShare

akshare = AKShare()
# You can substitute 000001.SZ with 000001, and it will still work!
# and you can set start_date as "20210301" format
# You can also set adjust to 'qfq' or 'hfq' to adjust the data,
# and set timeframe to '1d', '1w' to get daily, weekly data
df = akshare.query(
    symbols=['000001.SZ', '600000.SH'],
    start_date='3/1/2021',
    end_date='3/1/2023',
    adjust="",
    timeframe="1d",
)
df
Loading bar data...
Loaded bar data: 0:00:10

[10]:
date symbol open high low close volume
0 2021-03-01 000001.SZ 21.54 21.68 21.18 21.45 1125387
1 2021-03-01 600000.SH 10.59 10.64 10.50 10.58 547461
2 2021-03-02 000001.SZ 21.62 22.15 21.26 21.65 1473425
3 2021-03-02 600000.SH 10.61 10.70 10.36 10.47 747631
4 2021-03-03 000001.SZ 21.58 23.08 21.46 23.01 1919635
... ... ... ... ... ... ... ...
969 2023-02-27 600000.SH 7.16 7.20 7.16 7.16 158006
970 2023-02-28 000001.SZ 13.75 13.85 13.61 13.78 607936
971 2023-02-28 600000.SH 7.18 7.20 7.14 7.18 174481
972 2023-03-01 000001.SZ 13.80 14.19 13.74 14.17 1223452
973 2023-03-01 600000.SH 7.17 7.27 7.17 7.26 256613

974 rows × 7 columns

NOTE: If the above causes a Native library not available error and you still want to use AKShare, then see this issue for details on how to resolve it.

In the next notebook, we’ll take a look at how to use DataSources to backtest a simple trading strategy.