Zipline Data Bundles

Data Bundles

A data bundle is a collection of pricing data, adjustment data, and an asset database. Bundles allow us to preload all of the data we will need to run backtests and store the data for future runs.

数据包是定价数据,调整数据和资产数据库的集合。 Bundles允许我们预先加载所有需要运行回测的数据并存储数据以供将来运行。

Discovering Available Bundles发现可用的捆绑包

Zipline comes with a few bundles by default as well as the ability to register new bundles. To see which bundles we have available, we may run the bundles command, for example:

Zipline默认带有几个捆绑包以及注册新捆绑包的功能。 要查看我们可用的捆绑包,我们可以运行捆绑命令,例如:

$ zipline bundles
my-custom-bundle 2016-05-05 20:35:19.809398
my-custom-bundle 2016-05-05 20:34:53.654082
my-custom-bundle 2016-05-05 20:34:48.401767
quandl <no ingestions>
quantopian-quandl 2016-05-05 20:06:40.894956

 The output here shows that there are 3 bundles available: 这里的输出显示有3个可用包:

  • my-custom-bundle (added by the user)
  • quandl (provided by zipline)
  • quantopian-quandl (provided by zipline)

The dates and times next to the name show the times when the data for this bundle was ingested. We have run three different ingestions for my-custom-bundle. We have never ingested any data for the quandl bundle so it just shows <no ingestions> instead. Finally, there is only one ingestion for quantopian-quandl.

名称旁边的日期和时间显示摄取这个包的数据的时间。对 my-custom-bundle  有三种不同的提示。我们从来没有摄入任何quandl包的数据,所以它只是显示<no ingestions>。最后,Quantopian-quandl只有一次摄取。

Ingesting Data

The first step to using a data bundle is to ingest the data. The ingestion process will invoke some custom bundle command and then write the data to a standard location that zipline can find. By default the location where ingested data will be written is $ZIPLINE_ROOT/data/<bundle> where by default ZIPLINE_ROOT=~/.zipline. The ingestion step may take some time as it could involve downloading and processing a lot of data. You’ll need a Quandl API key to ingest the default bundle. This can be run with:

使用数据包的第一步是摄取数据。摄入过程将调用一些自定义捆绑命令,然后将数据写入zipline可以找到的标准位置。默认情况下,写入数据的位置是$ ZIPLINE_ROOT / data / <bundle>,默认情况下ZIPLINE_ROOT =〜/ .zipline。摄取步骤可能需要一些时间,因为它可能涉及下载和处理大量数据。您需要一个Quandl API密钥来获取默认包。这可以通过以下方式运行:

$ QUANDL_API_KEY=<yourkey> zipline ingest [-b <bundle>]

where <bundle> is the name of the bundle to ingest, defaulting to quandl. 其中<bundle>是要摄取的包的名称,默认为quandl。

Old Data

When the ingest command is used it will write the new data to a subdirectory of $ZIPLINE_ROOT/data/<bundle> which is named with the current date. This makes it possible to look at older data or even run backtests with the older copies. Running a backtest with an old ingestion makes it easier to reproduce backtest results later.

当使用ingest 命令时,它会将新数据写入到以当前日期命名的$ ZIPLINE_ROOT / data / <bundle>的子目录中。这使得可以查看较旧的数据,甚至可以使用较旧的副本进行反向测试。使用旧的服务运行回测可以更容易地在以后重新生成回测结果。

One drawback of saving all of the data by default is that the data directory may grow quite large even if you do not want to use the data. As shown earlier, we can list all of the ingestions with the bundles command. To solve the problem of leaking old data there is another command: clean, which will clear data bundles based on some time constraints.

默认情况下保存所有数据的一个缺点是,即使您不想使用数据,数据目录也可能变得非常大。如前所示,我们可以使用bundles命令列出所有的提示。为了解决泄露旧数据的问题,还有另外一个命令:clean,它将根据一些时间限制清除数据包。

For example:

# clean everything older than <date>
$ zipline clean [-b <bundle>] --before <date>

# clean everything newer than <date>
$ zipline clean [-b <bundle>] --after <date>

# keep everything in the range of [before, after] and delete the rest
$ zipline clean [-b <bundle>] --before <date> --after <after>

# clean all but the last <int> runs
$ zipline clean [-b <bundle>] --keep-last <int>

Running Backtests with Data Bundles 使用数据包运行回测

Now that the data has been ingested we can use it to run backtests with the run command. The bundle to use can be specified with the --bundle option like:

现在数据已被摄入,我们可以使用它来运行run命令。可以使用--bundle选项指定要使用的软件包,如下所示:

$ zipline run --bundle <bundle> --algofile algo.py ...

We may also specify the date to use to look up the bundle data with the --bundle-timestamp option. Setting the --bundle-timestamp will cause run to use the most recent bundle ingestion that is less than or equal to the bundle-timestamp. This is how we can run backtests with older data.bundle-timestamp uses a less-than-or-equal-to relationship so that we can specify the date that we ran an old backtest and get the same data that would have been available to us on that date. The bundle-timestamp defaults to the current day to use the most recent data.

我们也可以用--bundle-timestamp选项指定查找捆绑数据的日期。 设置--bundle-timestamp参数将运行使用小于或等于捆绑包时间戳的最新捆绑包提取。 这就是我们如何使用旧数据运行回测.bundle-timestamp使用的是小于或等于关系,以便我们可以指定运行旧的backtest的日期并获取可用的相同数据 我们在那个日子。 捆绑包时间戳默认为当天使用最新数据。

Default Data Bundles 默认数据包

Quandl WIKI Bundle

By default zipline comes with the quandl data bundle which uses quandl’s WIKI dataset. The quandl data bundle includes daily pricing data, splits, cash dividends, and asset metadata. To ingest the quandl data bundle we recommend creating an account on quandl.com to get an API key to be able to make more API requests per day. Once we have an API key we may run:

默认情况下,zipline附带使用quandl的WIKI数据集的quandl数据包。 quandl数据包包括每日定价数据,拆分,现金股息和资产元数据。 为了摄取quandl数据包,我们建议在quandl.com上创建一个帐户以获取API密钥,以便每天可以发出更多API请求。 一旦我们拥有API密钥,我们可以运行:

$ QUANDL_API_KEY=<api-key> zipline ingest -b quandl

though we may still run ingest as an anonymous quandl user (with no API key). We may also set the QUANDL_DOWNLOAD_ATTEMPTS environment variable to an integer which is the number of attempts that should be made to download data from quandls servers. By default QUANDL_DOWNLOAD_ATTEMPTS will be 5, meaning that we will retry each attempt 5 times.

不过我们仍然可以使用匿名quandl用户摄取(没有API密钥)。 我们也可以将QUANDL_DOWNLOAD_ATTEMPTS环境变量设置为一个整数,该整数是应该从quandls服务器下载数据的尝试次数。 默认情况下,QUANDL_DOWNLOAD_ATTEMPTS将为5,这意味着我们将每次重试5次。

Note 注意

QUANDL_DOWNLOAD_ATTEMPTS is not the total number of allowed failures, just the number of allowed failures per request. The quandl loader will make one request per 100 equities for the metadata followed by one request per equity. QUANDL_DOWNLOAD_ATTEMPTS不是允许失败的总次数,而是每次请求允许失败的次数。 quandl载入程序将根据元数据每100个股票发出一个总请求,然后每个股票发出一个请求。

Writing a New Bundle 写一个新的捆绑包

Data bundles exist to make it easy to use different data sources with zipline. To add a new bundle, one must implement an ingest function.数据包的存在使得使用zipline可以轻松使用不同的数据源。 要添加一个新的包,必须实现一个摄取功能。

The ingest function is responsible for loading the data into memory and passing it to a set of writer objects provided by zipline to convert the data to zipline’s internal format. The ingest function may work by downloading data from a remote location like the quandl bundle or it may just load files that are already on the machine. The function is provided with writers that will write the data to the correct location transactionally. If an ingestion fails part way through the bundle will not be written in an incomplete state.摄取函数负责将数据加载到内存中,并将其传递给由zipline提供的一组writer对象,以将数据转换为zipline的内部格式。 摄取功能可以通过从远程位置下载数据(如quandl包)来工作,或者它可以只加载已经在机器上的文件。 该功能与编写者一起提供,它们将数据写入正确的位置。 如果摄取失败,则不会以未完成状态书写。

The signature of the ingest function should be: 摄取功能的签名应该是:

ingest(environ,
       asset_db_writer,
       minute_bar_writer,
       daily_bar_writer,
       adjustment_writer,
       calendar,
       start_session,
       end_session,
       cache,
       show_progress,
       output_dir)

environ

environ is a mapping representing the environment variables to use. This is where any custom arguments needed for the ingestion should be passed, for example: the quandl bundle uses the enviornment to pass the API key and the download retry attempt count.

environ是一个代表要使用的环境变量的映射。 这就是传递所需的任何自定义参数的地方,例如:quandl bundle使用环境传递API密钥和下载重试尝试计数。

asset_db_writer

asset_db_writer is an instance of AssetDBWriter. This is the writer for the asset metadata which provides the asset lifetimes and the symbol to asset id (sid) mapping. This may also contain the asset name, exchange and a few other columns. To write data, invoke write() with dataframes for the various pieces of metadata. More information about the format of the data exists in the docs for write.

asset_db_writer是AssetDBWriter的一个实例。 这是资产元数据的写入器,它提供资产生命周期以及资产id(sid)映射的符号。 这也可能包含资产名称,交换和其他一些列。 要写入数据,请调用带有各种元数据的数据框的write()。 有关数据格式的更多信息存在于写入文档中。

minute_bar_writer 分钟线柱写入器

minute_bar_writer is an instance of BcolzMinuteBarWriter. This writer is used to convert data to zipline’s internal bcolz format to later be read by a BcolzMinuteBarReader. If minute data is provided, users should call write() with an iterable of (sid, dataframe) tuples. The show_progress argument should also be forwarded to this method. If the data source does not provide minute level data, then there is no need to call the write method. It is also acceptable to pass an empty iterator to write() to signal that there is no minutely data.

minute_bar_writer是BcolzMinuteBarWriter的一个实例。该写入器用于将数据转换为zipline的内部bcolz格式,以便以后由BcolzMinuteBarReader读取。 如果提供了分钟数据,用户应该使用(sid,dataframe)元组的迭代对象来调用write()。 show_progress参数也应该被转发到这个方法。 如果数据源不提供分钟级数据,则不需要调用写入方法。 也可以传递一个空的迭代器到write()来表示没有分钟级的数据。

Note 注意

The data passed to write() may be a lazy iterator or generator to avoid loading all of the minute data into memory at a single time. A given sid may also appear multiple times in the data as long as the dates are strictly increasing. 传递给write()的数据可能是一个懒惰的迭代器或生成器,以避免一次性将所有分钟数据加载到内存中。 只要日期严格增加,给定的sid也可能在数据中出现多次。

daily_bar_writer 日线柱写入器

daily_bar_writer is an instance of BcolzDailyBarWriter. This writer is used to convert data into zipline’s internal bcolz format to later be read by a BcolzDailyBarReader. If daily data is provided, users should call write() with an iterable of (sid dataframe) tuples. The show_progress argument should also be forwarded to this method. If the data shource does not provide daily data, then there is no need to call the write method. It is also acceptable to pass an empty iterable to write() to signal that there is no daily data. If no daily data is provided but minute data is provided, a daily rollup will happen to service daily history requests.

daily_bar_writer是BcolzDailyBarWriter的一个实例。 该写入器用于将数据转换为zipline的内部bcolz格式,以便以后由BcolzDailyBarReader读取。 如果提供每日数据,用户应该用(sid dataframe)元组的迭代对象来调用write()。 show_progress参数也应该被转发到这个方法。 如果数据源不提供每日数据,则不需要调用写入方法。 将空的迭代器传递给write()来表明没有日常数据也是可以接受的。 如果没有提供日常数据但提供分钟数据,则每日汇总将发生以服务每日历史请求。

Note

Like the minute_bar_writer, the data passed to write() may be a lazy iterable or generator to avoid loading all of the data into memory at once. Unlike the minute_bar_writer, a sid may only appear once in the data iterable.

像minute_bar_writer一样,传递给write()的数据可能是一个惰性迭代器或生成器,以避免一次将所有数据加载到内存中。 与minute_bar_writer不同,sid在数据迭代中只能出现一次。

adjustment_writer

adjustment_writer is an instance of SQLiteAdjustmentWriter. This writer is used to store splits, mergers, dividends, and stock dividends. The data should be provided as dataframes and passed to write(). Each of these fields are optional, but the writer can accept as much of the data as you have.

adjust_writer是SQLiteAdjustmentWriter的一个实例。 该写入器用于存储拆分,合并,分红和股票分红。数据应作为数据框提供并传递给write()。每个字段都是可选的,但写入器可以接受尽可能多的数据。

calendar

calendar is an instance of zipline.utils.calendars.TradingCalendar. The calendar is provided to help some bundles generate queries for the days needed.

calendar是zipline.utils.calendars.TradingCalendar的一个实例。 提供日历是为了帮助一些软件包生成所需日期的查询。

start_session

start_session is a pandas.Timestamp object indicating the first day that the bundle should load data for.

start_session是一个pandas.Timestamp对象,指示包应该加载数据的第一天。

end_session

end_session is a pandas.Timestamp object indicating the last day that the bundle should load data for.

end_session是一个pandas.Timestamp对象,指示应该加载数据的最后一天。

cache

cache is an instance of dataframe_cache. This object is a mapping from strings to dataframes. This object is provided in case an ingestion crashes part way through. The idea is that the ingest function should check the cache for raw data, if it doesn’t exist in the cache, it should acquire it and then store it in the cache. Then it can parse and write the data. The cache will be cleared only after a successful load, this prevents the ingest function from needing to redownload all the data if there is some bug in the parsing. If it is very fast to get the data, for example if it is coming from another local file, then there is no need to use this cache.

cache是dataframe_cache的一个实例。 该对象是从字符串到数据框的映射。 这个对象是在摄入中途崩溃的情况下提供的。 这个想法是,摄取函数应该检查缓存中的原始数据,如果缓存中不存在,它应该获取它并将其存储在缓存中。 然后它可以解析和写入数据。 缓存只有在成功加载后才会被清除,这样可以防止在解析中存在一些错误时,导入功能不需要重新下载所有数据。 如果获取数据的速度非常快,例如来自其他本地文件的数据,则无需使用此缓存。

show_progress

show_progress is a boolean indicating that the user would like to receive feedback about the ingest function’s progress fetching and writing the data. Some examples for where to show how many files you have downloaded out of the total needed, or how far into some data conversion the ingest function is. One tool that may help with implementing show_progress for a loop ismaybe_show_progress. This argument should always be forwarded to minute_bar_writer.write and daily_bar_writer.write.

show_progress是一个布尔值,表示用户希望收到有关摄取函数进度取回和写入数据的反馈。 一些示例显示从总共需要的文件中下载了多少个文件,或者摄取功能的数据转换有多远。 一个可能有助于实现show_progress循环的工具是maybe_show_progress。 应始终将此参数转发给minute_bar_writer.write和daily_bar_writer.write。

output_dir

output_dir is a string representing the file path where all the data will be written. output_dir will be some subdirectory of $ZIPLINE_ROOT and will contain the time of the start of the current ingestion. This can be used to directly move resources here if for some reason your ingest function can produce it’s own outputs without the writers. For example, the quantopian:quandl bundle uses this to directly untar the bundle into the output_dir.

output_dir是一个字符串,表示将写入所有数据的文件路径。 output_dir将是$ ZIPLINE_ROOT的子目录,并包含当前摄取开始的时间。 如果由于某种原因,您的摄取功能可以在没有写入器的情况下生成它自己的输出,则可以使用此功能直接在此处移动资源。 例如,quantopian:quandl bundle使用它来直接将数据包解压到output_dir中。

Ingesting Data from .csv Files 从.csv文件摄取数据

Zipline provides a bundle called csvdir, which allows users to ingest data from .csv files. The format of the files should be in OHLCV format, with dates, dividends, and splits. A sample is provided below. There are other samples for testing purposes in zipline/tests/resources/csvdir_samples.

Zipline提供了一个名为csvdir的软件包,允许用户从.csv文件获取数据。 文件格式应采用OHLCV格式,包含日期,分红和分割。 下面提供了一个示例。 在zipline/tests/resources/csvdir_samples中还有其他样本用于测试目的。

date,open,high,low,close,volume,dividend,split
2012-01-03,58.485714,58.92857,58.42857,58.747143,75555200,0.0,1.0
2012-01-04,58.57143,59.240002,58.468571,59.062859,65005500,0.0,1.0
2012-01-05,59.278572,59.792858,58.952858,59.718571,67817400,0.0,1.0
2012-01-06,59.967144,60.392857,59.888573,60.342857,79573200,0.0,1.0
2012-01-09,60.785713,61.107143,60.192856,60.247143,98506100,0.0,1.0
2012-01-10,60.844284,60.857143,60.214287,60.462856,64549100,0.0,1.0
2012-01-11,60.382858,60.407143,59.901428,60.364285,53771200,0.0,1.0

Once you have your data in the correct format, you can edit your extension.py file in~/.zipline/extension.py and import the csvdir bundle, along with pandas.

一旦你的数据格式正确,你可以在〜/.zipline/extension.py中编辑你的extension.py文件,并导入csvdir和pandas。

import pandas as pd

from zipline.data.bundles import register
from zipline.data.bundles.csvdir import csvdir_equities

We’ll then want to specify the start and end sessions of our bundle data:

然后,我们要赋值给我们的捆绑数据的开始会话和结束会话:

start_session = pd.Timestamp('2016-1-1', tz='utc')
end_session = pd.Timestamp('2018-1-1', tz='utc')

And then we can register() our bundle, and pass the location of the directory in which our .csvfiles exist:

然后我们可以 register() 我们的包,并传递我们的.csv文件所在的目录的位置:

register(
    'custom-csvdir-bundle',
    csvdir_equities(
        ['daily'],
        '/path/to/your/csvs',
    ),
    calendar_name='NYSE', # US equities
    start_session=start_session,
    end_session=end_session
)

To finally ingest our data, we can run: 为了最终获取我们的数据,我们可以运行:

$ zipline ingest -b custom-csvdir-bundle
Loading custom pricing data:   [############------------------------]   33% | FAKE: sid 0
Loading custom pricing data:   [########################------------]   66% | FAKE1: sid 1
Loading custom pricing data:   [####################################]  100% | FAKE2: sid 2
Loading custom pricing data:   [####################################]  100%
Merging daily equity files:  [####################################]

# optionally, we can pass the location of our csvs via the command line
$ CSVDIR=/path/to/your/csvs zipline ingest -b custom-csvdir-bundle

If you would like to use equities that are not in the NYSE calendar, or the existing zipline calendars, you can look at the Trading Calendar Tutorial to build a custom trading calendar that you can then pass the name of to register().

如果您想使用不在NYSE日历中的股票或现有的zipline日历,可以查看交易日历教程来构建自定义交易日历,然后您可以将名称传递给register()。

原文地址:https://www.cnblogs.com/bitquant/p/Zipline-Data-Bundles.html