Building two indica rice reference genomes with PacBio long-read and Illumina paired-end sequencing data

利用PacBio long-read和Illumina配对端测序数据构建两个籼稻参考基因组

Jianwei Zhang,
Ling-Ling Chen,
[…]
Qifa Zhang

Scientific Data volume 3, Article number: 160076 (2016) Cite this article

1477 Accesses
14 Citations
12 Altmetric
Metricsdetails

Abstract

Over the past 30 years, we have performed many fundamental studies on two Oryza sativa subsp. indica varieties, Zhenshan 97 (ZS97) and Minghui 63 (MH63). To improve the resolution of many of these investigations, we generated two reference-quality reference genome assemblies using the most advanced sequencing technologies. Using PacBio SMRT technology, we produced over 108 (ZS97) and 174 (MH63) Gb of raw sequence data from 166 (ZS97) and 209 (MH63) pools of BAC clones, and generated ~97 (ZS97) and ~74 (MH63) Gb of paired-end whole-genome shotgun (WGS) sequence data with Illumina sequencing technology. With these data, we successfully assembled two platinum standard reference genomes that have been publicly released. Here we provide the full sets of raw data used to generate these two reference genome assemblies. These data sets can be used to test new programs for better genome assembly and annotation, aid in the discovery of new insights into genome structure, function, and evolution, and help to provide essential support to biological research in general.

Design Type(s)	strain comparison design • genome assembly
Measurement Type(s)	reference genome data • whole genome sequencing
Technology Type(s)	DNA sequencing
Factor Type(s)	selectively maintained organism
Sample Characteristic(s)	Oryza sativa Indica Group

Machine-accessible metadata file describing the reported data (ISA-Tab format)

Background & Summary

Rice is the leading staple crop for mankind and has been recognized as an important model organism for biological research, especially for monocot plants. Asian cultivated rice (Oryza sativa) is composed of two subspecies: O. sativa subsp. japonica and subsp. indica; indica rice accounts for over 70% of rice production worldwide1 and is genetically much more diverse2. The indica varieties Zhenshan 97 (ZS97) and Minghui 63 (MH63) represent two major varietal groups of indica rice3, contain a number of important agronomic traits and are the parents of Shanyou 63 (SY63), the most widely cultivated hybrid rice in China. The ZS97, MH63, SY63 hybrid system has been used as a model4–9 over the past 30 years, and concomitantly our lab has made a series of attempts to gain a fundamental understanding of the genetic basis of heterosis—a biological mystery that has puzzled the scientific community for more than 100 years. Hence, we initiated a joint collaborative project to generate two reference-quality genome assemblies for ZS97 and MH63 to be used as a fundamental tool to help us understand the underlying molecular genetic basis of heterosis10. In this descriptor, we report the resources and data sets that were generated and used to assemble the ZS97 and MH63 reference genomes: (1) two BAC libraries, (2) two improved physical maps and minimum tiling paths (MTP), (3) raw PacBio sequencing data of BAC pools and full clone sequence assemblies, (4) Illumina WGS sequence and assembly data, and (5) the first release of reference genome assemblies for ZS97 and MH63. With the resources and data generated in this study, we were not only able to de novo assemble two reference-quality genome sequences, but we were able to provide the scientific community with data to advance biological research at the genomic level, especially for further understanding of the genetic basis of heterosis.

利用PacBio long-read和Illumina配对端测序数据构建两个籼稻参考基因组
张建伟,
玲玲,
[…]
Qifa张
《科学数据》第3卷，文章号:160076(2016)引用本文

1477年访问

14引用

12 Altmetric

Metricsdetails

摘要
在过去的30年里，我们对两种水稻进行了大量的基础研究。
籼稻品种振山97 (ZS97)和明辉63 (MH63)。
为了提高这些研究的分辨率，我们使用最先进的测序技术生成了两个参考质量的参考基因组组件。
利用PacBio SMRT技术，从166个(ZS97)和209个(MH63)池中提取了108 (ZS97)和174 (MH63) Gb的原始序列数据，利用Illumina测序技术生成了~97 (ZS97)和~74 (MH63) Gb的配对端全基因组散弹枪(WGS)序列数据。
利用这些数据，我们成功地组装了两个已公开发布的铂标准参考基因组。
这里我们提供了用于生成这两个参考基因组组件的完整的原始数据集。
这些数据集可用于测试新的程序，以便更好地进行基因组组装和注释，有助于发现基因组结构、功能和进化方面的新见解，并有助于为一般的生物学研究提供必要的支持。

设计类型(s)菌株比较设计•基因组组装
参考基因组数据•全基因组测序
技术类型(s) dna测序
选择性维持生物体的因子类型
籼稻组样品特征
描述报告数据的可由机器访问的元数据文件(ISA-Tab格式)

背景和总结
水稻是人类主要的主粮作物，已被公认为生物研究的重要模式生物，特别是单子叶植物。
亚洲栽培稻(Oryza sativa)由两个亚种组成:稻亚种(O. sativa subsp)。
粳稻和无性系种群。
籼稻;
籼稻占全球水稻产量的70%以上，而且在基因上差异更大。
籼稻品种振山97 (ZS97)和明辉63 (MH63)是水稻籼稻的两个主要品种群，具有许多重要的农学性状，是我国栽培最广泛的杂交水稻汕优63 (SY63)的亲本。
MH63, ZS97 SY63混合动力系统被用作model4-9过去30年,并同时减少我们的实验室已经进行了一系列的尝试获得一个基本的认识的遗传基础heterosis-a生物奥秘一直困扰科学界100多年。
因此，我们发起了一项联合合作项目，为ZS97和MH63生成两个具有参考价值的基因组组件，作为帮助我们理解异位性的潜在分子遗传基础的基础工具。
在这个描述符,我们的资源和数据集生成报告和用于组装ZS97 MH63参考基因组:(1)两个BAC库,(2)两个改善物理地图和最低花砖路径(MTP),(3)原始PacBio测序数据的BAC池和完整的克隆序列集,(4)Illumina公司WGS序列和组装数据,和(5)的第一个版本参考基因组装配ZS97和MH63。
资源和数据的生成在这项研究中,我们不仅能够从头组装两个reference-quality基因组序列,但我们能够为科学界提供数据推动基因组生物学研究水平,尤其是对杂种优势的遗传基础的进一步理解。