Benchmarking of long-read assemblers for prokaryote whole genome sequencing

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

原核生物  全基因组测序的  长read组装的基准

Ryan R. WickConceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Writing – Original Draft Preparationa,1 and Kathryn E. HoltConceptualization, Supervision, Writing – Review & Editing1,2

Version Changes

Updated. Changes from Version 1

This version contains updated results for new versions of Flye (v2.7), Raven (v0.0.8) and Shasta (v0.4.0), and it adds a new assembler (NECAT v20200119) to the comparison. It also contains various small improvements made in response to the peer reviews.

Peer Review Summary

Review dateReviewer name(s)Version reviewedReview status
2020 Jan 30 Olin Silander Version 1 Approved
2020 Jan 22 Mikhail Kolmogorov Version 1 Approved
2020 Jan 16 Mile Šikić and Robert Vaser Version 1 Approved
2020 Jan 9 Steven L Salzberg and Aleksey Zimin Version 1 Approved
Abstract

Background: 

Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled – one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly.

Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of seven long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used.

Results: Canu v1.9 produced moderately reliable assemblies but had the longest runtimes of all assemblers tested. Flye v2.7 was more reliable and did particularly well with plasmid assembly. Miniasm/Minipolish v0.3 and NECAT v20200119 were the most likely to produce clean contig circularisation. Raven v0.0.8 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.4.0 were computationally efficient but more likely to produce incomplete assemblies.

Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.

Keywords: Assembly, long-read sequencing, Oxford Nanopore Technologies, Pacific Biosciences, microbial genomics, benchmarking

背景:
来自长read测序平台(牛津纳米孔技术和太平洋生物科学)的数据集允许大多数原核生物基因组完全组装,每条染色体或每个质粒一个contig。
然而,长读测序较高的单读错误率,需要不同的组装方法比短读测序。
存在多种装配工具(装配器),它们使用各种算法进行长read的装配。

方法:
我们使用500个模拟读集和120个真实读集来评估7个长读组装器(Canu、Flye、Miniasm/Minipolish、NECAT、Raven、Redbean和Shasta)在各种基因组和读参数上的性能。
对组件的结构精度/完整性、序列一致性、重叠循环和使用的计算资源进行评估。
结果:
Canu v1.9产生了比较可靠的组装器,但是在所有测试的组装器中,它的运行时间最长。
Flye v2.7更可靠,在质粒组装方面表现得特别好。
Miniasm/Minipolish v0.3和NECAT v20200119最有可能产生清洁的contig循环。
Raven v0.0.8是染色体装配最可靠的,尽管它在小质粒上表现不佳,并且存在循环问题。
Redbean v2.5和Shasta v0.4.0在计算上是高效的,但更有可能产生不完整的组装器。

结论:
在参加测试的组装者中,Flye、Miniasm/Minipolish和Raven的整体表现最好。
然而,没有一种工具在所有指标上都表现良好,这突出了对长read的装配算法进行持续开发的必要性。

关键词:

组装,长读测序,牛津纳米孔技术,太平洋生物科学,微生物基因组学,标杆管理

Introduction

Genome assembly is the computational process of using shotgun whole-genome sequencing data (reads) to reconstruct an organism’s true genomic sequence to the greatest extent possible 1. Software tools which carry out assembly (assemblers) take sequencing reads as input and produce reconstructed contiguous pieces of the genome (contigs) as output.

If a genome contains repetitive sequences (repeats) which are longer than the sequencing reads, then the underlying genome cannot be fully reconstructed without additional information; i.e. if no read spans a repeat in the genome, then that repeat cannot be resolved, limiting contig length 2. Short-read sequencing platforms (e.g. those made by Illumina) produce reads hundreds of bases in length and tend to result in shorter contigs. In contrast, long-read platforms from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) can generate reads tens of thousands of bases in length which span more repeats and thus result in longer contigs 3.

Prokaryote genomes are simpler than eukaryote genomes in a few aspects relevant to assembly. First, they are smaller, most being less than 10 Mbp in size 4. Second, they contain less repetitive content and their longest repeat sequences are often less than 10 kbp in length 5. Third, prokaryote genomes are haploid and thus avoid assembly-related complications from diploidy/polyploidy 6. These facts make prokaryote genome assembly a more tractable problem than eukaryote genome assembly, and in most cases a long-read set of sufficient depth should contain enough information to generate a complete assembly – each replicon in the genome being fully assembled into a single contig 7. Prokaryote genomes also have two other features relevant to assembly: they may contain plasmids that differ from the chromosome in copy number and therefore read depth, and most prokaryote replicons are circular with no defined start/end point.

In this study, we examine the performance of various long-read assemblers in the context of prokaryote whole genomes. We assessed each tool on its ability to generate complete assemblies using both simulated and real read sets. We also investigated prokaryote-specific aspects of assembly, such as performance on plasmids and the circularisation of contigs.

基因组组装是利用散弹枪全基因组测序数据(reads)最大限度地重建有机体真实基因组序列的计算过程1。
进行装配(assemators)的软件工具以序列读取作为输入,并生成重构的连续基因组片段(contigs)作为输出。
如果一个基因组包含的重复序列(重复序列)比测序读值长,那么在没有额外信息的情况下,底层基因组无法完全重建;
也就是说,如果基因组中没有read跨越一个重复,那么这个重复就无法解决,从而限制了contig长度为2。
短读测序平台(如由Illumina公司生产的测序平台)在长度上能读出数百个碱基,从而趋向于产生更短的contigs。
相比之下,牛津纳米孔技术公司(ONT)和太平洋生物科学公司(PacBio)的长读平台可以产生数以万计的碱基,这些碱基跨越更多的重复,从而导致更长的contigs 3。
原核生物基因组在与组装有关的几个方面比真核生物基因组简单。
首先,它们更小,大多数小于10 Mbp的大小为4。
其次,它们的重复内容较少,最长重复序列的长度通常小于10 kbp 5。
第三,原核生物基因组是单倍体,因此避免了二倍体/多倍体的组合相关并发症6。
这些事实使得原核生物的基因组组装比真核生物的基因组组装更容易处理,而且在大多数情况下,一套足够深入的长时间阅读的集合应该包含足够的信息来生成一个完整的组装,基因组中的每个复制子被完全组装成一个单一的contig 7。
原核生物基因组还有另外两个与组装有关的特征:它们可能含有与染色体拷贝数不同的质粒,因此读取深度不同,而且大多数原核生物复制子是环状的,没有明确的起始/终点。
在这项研究中,我们检查了在原核生物全基因组的背景下各种长read的组装器的性能。
我们评估了每种工具使用模拟和真实读集生成完整程序集的能力。
我们还研究了原核装配的特异性方面,如质粒的性能和contigs的循环。

Methods

Simulated read sets

Simulated read sets (read sequences generated in silico from reference genomes) offer some advantages over real read sets when assessing assemblers. They allow for a confident ground truth – i.e. the true underlying genome is known with certainty. They allow for large sample sizes, in practice limited only by computational resources. Also, a variety of genomes and read set parameters can be used to examine assembler performance over a wide range of scenarios. For this study, we simulated 500 read sets to test the assemblers, each using different parameters and a different prokaryote genome.

To select reference genomes for the simulated read sets, we first downloaded all bacterial and archaeal RefSeq genomes using ncbi-genome-download v0.2.10 (14333 genomes at the time of download) 8. We then performed some quality control steps: excluding genomes with a >10 Mbp chromosome, a <500 kbp chromosome, any >300 kbp plasmid, any plasmid >25% of the chromosome size or more than 9 plasmids ( Extended data, Figure S1) 9. We then ran Assembly Dereplicator v0.1.0 with a threshold of 0.1, resulting in 3153 unique genomes 10.

To produce a final set of 500 genomes with 500 plasmids, we randomly selected 250 genomes from those containing plasmids, repeating this selection until the genomes contained exactly 500 plasmids. We then added 250 genomes randomly selected from those without plasmids. Any ambiguous bases in the assemblies were replaced with ‘A’ to ensure that sequences contained only the four canonical DNA bases.

We then used Badread v0.1.5 to generate one read set for each input genome 11. The parameters for each set (controlling read depth, length, identity and errors) were randomly chosen to ensure a large amount of variability ( Extended data, Figure S2) 9. Note that not all of these read sets were sufficient to reconstruct the original genome (due to low depth or short read length), so even an ideal assembler would be incapable of completing an assembly for all 500 test sets.

For genomes containing plasmids, the read depth of plasmids relative to the chromosome was also set randomly, with limits based on the plasmid size ( Extended data, Figure S3) 9. Large plasmids were simulated at depths close to that of the chromosome while small plasmids spanned a wider range of depth. This was done to model the observed pattern that small plasmids often have a high per-cell copy number (i.e. may be high read depth) but can be biased against in library preparations (i.e. may be low read depth) 12. All replicons (chromosomes and plasmids) were treated as circular sequences in Badread, so the simulated read sets do not test assembler performance on linear sequences.

模拟读集
在评估组装器时,模拟读集(用硅胶从参考基因组生成的读序列)比实际读集有一些优势。
他们允许一个有信心的基础事实-即真正的潜在基因组是确定的。
它们允许大样本的大小,在实践中仅受计算资源的限制。
此外,可以使用各种基因组和读取集参数在各种场景中检查组装器的性能。
在这项研究中,我们模拟了500个read集来测试组装器,每一个使用不同的参数和不同的原核生物基因组。

为了选择模拟读集的参考基因组,我们首先使用ncbi-genome-download v0.2.10(下载时为14333个基因组)下载所有细菌和古细菌RefSeq基因组。
然后我们进行了一些质量控制步骤:剔除含有10 Mbp染色体、500 kbp染色体、300 kbp质粒、25%染色体大小的质粒或大于9个质粒的基因组(扩展数据,图S1)。
然后,我们以0.1的阈值运行Assembly Dereplicator v0.1.0,得到3153个唯一基因组。

为了产生500个带有500个质粒的基因组,我们从含有质粒的基因组中随机选择250个,重复这样的选择,直到基因组恰好包含500个质粒。
然后我们从没有质粒的基因组中随机选择250个基因组加入。
装配体中任何不明确的碱基都被“A”取代,以确保序列只包含四个典型的DNA碱基。

然后我们使用Badread v0.1.5为每个输入基因组生成一个读集11。
每一组的参数(控制读取深度、长度、标识和错误)都是随机选择的,以确保具有很大的可变性(扩展数据,图S2) 9。
请注意,并非所有这些读集都足以重建原始基因组(由于深度较低或读长较短),因此即使是一个理想的组装者也无法完成全部500个测试集的组装。

对于含有质粒的基因组,质粒相对于染色体的读取深度也是随机设置的,根据质粒的大小有限制(扩展数据,图S3) 9。
大质粒的模拟深度接近染色体,而小质粒的模拟深度范围更广。
这样做是为了模拟观察到的模式,即小的质粒通常具有高的每个细胞拷贝数(即高读取深度),但在文库准备中可能存在偏差(即,可能低读取深度)12。
在Badread中,所有复制子(染色体和质粒)都被视为循环序列,因此模拟读集不测试线性序列上的装配性能。

Real read sets

Despite the advantages of simulated read sets, they can be unrealistic because read simulation tools (such as Badread) may not accurately model all relevant features: error profiles, read lengths, quality scores, etc. Real read sets are therefore also valuable when assessing assemblers. The challenge with real read sets is obtaining a ground truth genome against which assemblies can be checked. Since many reference genome sequences are produced using long-read assemblies, there is the risk of circular reasoning – if we use an assembly as our ground truth reference, our results will be biased in favour of whichever assembler produced the reference.

To avoid this issue, we used the datasets produced in a recent study comparing ONT (MinION R9.4) and PacBio (RSII CLR) data which also included Illumina reads for each isolate 13. For each of the 20 bacterial isolates in that study, we conducted two hybrid assemblies using Unicycler v0.4.7: Illumina+ONT and Illumina+PacBio 14. Unicycler works by first generating an assembly graph using the Illumina reads, then using long-read alignments to scaffold the graph’s contigs into a completed genome – a distinct approach from any of the long-read assemblers tested in this study. We ran the assemblies using Unicycler’s --no_miniasm option so it skipped its Miniasm-based step which could bias the results in favour of Miniasm/Minipolish. We then excluded any isolate where either hybrid assembly failed to reach completion or where there were >50 nucleotide differences between the two assemblies as determined by a Minimap2 alignment 15. I.e. the Illumina+ONT and Illumina+PacBio hybrid assemblies needed to be in near-perfect agreement with each other. This left six isolates for inclusion. The above process may have biased these isolates in favour of easier-to-assemble genomes, as more complex genomes would be more likely to encounter inconsistencies between the two Unicycler assemblies.

The ONT and PacBio read sets for these isolates were quite deep (156× to 535×) so to increase the number of assembly tests, we produced ten random read subsets of each, ranging from 40× to 100× read depth. This resulted in 120 total read sets for testing the assemblers (6 genomes × 2 platforms × 10 read subsets). The Illumina+ONT hybrid assembly was used as ground truth for each isolate.

All real and simulated read sets 16 and reference genomes 17 are available as Underlying data.

真实读集
尽管模拟读取集具有优势,但它们可能是不现实的,因为读取模拟工具(如Badread)可能无法精确地建模所有相关特性:错误概要、读取长度、质量分数等。
因此,实际读集在评估组装器时也很有价值。
与真实读集的挑战是获得一个地面真相基因组,集合可以检查。
由于许多参考基因组序列是使用长读程序集产生的,如果我们使用程序集作为ground truth参考,就会有循环推理的风险,我们的结果将偏向于哪个汇编程序产生的参考。
为了避免这个问题,我们使用了最近一项比较ONT (MinION R9.4)和PacBio (RSII CLR)数据产生的数据集,其中也包括每个分离物的Illumina读数13。
对于该研究中的20个分离菌,我们使用Unicycler v0.4.7分别进行了两个杂交组装:Illumina+ONT和Illumina+PacBio 14。
Unicycler的工作原理是首先使用Illumina reads生成一个组装图,然后使用长读比对将图上的contigs支架成一个完整的基因组,这是与本研究中测试的任何长读组装器不同的方法。
我们使用Unicycler s——no_miniasm选项运行程序集,因此它跳过了基于Miniasm的步骤,这可能会使结果偏向于Miniasm/Minipolish。
然后,我们排除了杂交装配未能完成的任何分离物,或者通过Minimap2比对确定的两个装配物之间存在50个核苷酸差异的分离物。
也就是说,Illumina+ONT和Illumina+PacBio杂交组合需要彼此接近完美的一致性。
这就留下了6个分离的包裹体。
上述过程可能使这些分离体偏向于更容易组装的基因组,因为更复杂的基因组更有可能遇到两个单链细胞组装之间的不一致。
这些分离菌的ONT和PacBio读取集非常深(156到535),因此为了增加装配测试的数量,我们为每个分离菌生成了10个随机读取子集,读取深度从40到100不等。
这导致总共有120个读取集用于测试装配体(6个基因组,2个平台,10个读取子集)。
Illumina+ONT杂交组合作为每个分离物的ground truth。
所有真实和模拟读集16和参考基因组17都可作为基础数据。

Assemblers tested

We assembled each of the read sets using the current versions of seven long-read assemblers: Canu v1.9, Flye v2.7, Miniasm/Minipolish v0.3, NECAT v20200119, Raven v0.0.8, Redbean v2.5 and Shasta v0.4.0. Default parameters were used except where stated, and exact commands for each tool are given in the Extended data, Figure S4 9. Assemblers that only work on PacBio reads (i.e. not on ONT reads) were excluded (HGAP 18, FALCON 19, HINGE 20 and Dazzler 21), as were hybrid assemblers which also require short read input (Unicycler 14 and MaSuRCA 22).

Canu has the longest history of all the assemblers tested, with its first release dating back to 2015. It performs assembly by first correcting reads, then trimming reads (removing adapters and breaking chimeras) and finally assembling reads into contigs 23. Its assembly strategy uses a modified version of the string graph algorithm 24, sometimes referred to as the overlap-layout-consensus (OLC) approach.

Flye takes a different approach to assembly:

first combining reads into error-prone disjointigs, then collapsing repetitive sequences to make a repeat graph and finally resolving the graph’s repeats to make the final contigs 25. Of particular note to prokaryote assemblies, Flye has options for recovery of small plasmids ( --plasmids) and uneven depth of coverage ( --meta), both of which we used in this analysis.

Miniasm builds a string graph from a set of read overlaps – i.e. it performs only the layout step of OLC. It does not perform read overlapping which must be done separately with Minimap2, and it does not have a consensus step, so its assembly error rates are comparable to raw read error rates. A separate polishing tool such as Racon is therefore required to achieve high sequence identity 26. For this study, we developed a tool called Minipolish to simplify this process by conducting Racon polishing (two rounds by default) on a Miniasm assembly graph 27. To ensure clean circularisation of prokaryote replicons, circular contigs are ‘rotated’ (have their starting position adjusted) between polishing rounds. Minipolish also comes with a script ( miniasm_and_minipolish.sh) which carries out all assembly steps (Minimap2 overlapping, Miniasm assembly and Minipolish consensus) in a single command, and subsequent references to ‘Miniasm/Minipolish’ refer to this entire pipeline.

NECAT follows an approach similar to Canu: first correcting the input reads, then building an assembly from the corrected reads 28. Both the correction and assembly steps are progressive, using multiple processing steps to achieve better accuracy/completeness.

Raven (previously known as Ra) is another tool which takes an OLC approach to assembly 29. Its overlapping step shares algorithms with Minimap2, and its consensus step is based on Racon, making it similar to Miniasm/Minipolish. It differs in its layout step which includes novel approaches to remove spurious overlaps from the graph, helping to improve assembly contiguity.

Redbean (previously known as Wtdbg2) uses an approach to long-read assembly called a fuzzy Bruijn graph 30. This is modelled on the De Bruijn graph concept widely used for short-read assembly 31 but modified to work with the inexact sequence matches present in noisy long reads.

Shasta is an assembler designed for computational efficiency 32. To achieve this, much of its assembly pipeline is performed not directly on read sequences but rather on a reduced representation of marker k-mers. These markers are used to find overlaps and build an assembly graph from which a consensus sequence is derived.

组装器测试
我们使用Canu v1.9、Flye v2.7、Miniasm/Minipolish v0.3、NECAT v20200119、Raven v0.0.8、Redbean v2.5和Shasta v0.4.0这7个长读汇编器的当前版本来组装每个读集。
除了声明之外,使用了默认参数,扩展数据中给出了每个工具的精确命令,图S4 9。
只处理PacBio读取(即不处理ONT读取)的汇编程序(HGAP 18、FALCON 19、HINGE 20和炫富21)被排除,同样需要短读取输入的混合汇编程序(Unicycler 14和MaSuRCA 22)也被排除。
Canu是所有被测试的汇编程序中历史最悠久的,它的第一个版本可以追溯到2015年。
它执行组装,首先纠正读取,然后削减读取(删除适配器和打破嵌合体),最后组装读取到contigs 23。
它的装配策略使用了字符串图算法24的修改版本,有时也被称为重叠-布图一致(OLC)方法。
Flye采用了一种不同的汇编方法:首先将读到的内容合并成容易出错的脱节,然后将重复序列折叠成一个重复图,最后解决图s的重复,以生成最终的contigs 25。
对于原核生物组装,需要特别注意的是,Flye可以选择回收小质粒(—质粒)和不均匀覆盖深度(—meta),我们在本分析中使用了这两种方法。
Miniasm从一组读过的重叠部分构建一个字符串图,即它只执行OLC的布局步骤。
它不执行必须由Minimap2单独完成的读重叠,而且它没有一致步骤,因此它的汇编错误率与原始读错误率相当。
因此,需要一个单独的抛光工具,如Racon,以实现高序列标识26。
在这项研究中,我们开发了一个名为Minipolish的工具,通过对Miniasm装配图进行Racon抛光(默认为两轮)来简化这一过程27。
为了确保原核生物复制子的清洁循环,在抛光轮之间旋转圆形叠架(调整它们的起始位置)。
Minipolish还附带一个脚本(miniasm_and_minipolish.sh),该脚本在一个命令中执行所有的组装步骤(Minimap2重叠、Miniasm组装和Minipolish一致),随后对Miniasm/Minipolish的引用指的是整个管道。
NECAT采用与Canu类似的方法:首先对输入读取进行校正,然后根据校正的读取值构建一个程序集28。
校正和装配步骤都是循序渐进的,使用多个加工步骤来达到更好的精度/完整性。
Raven(以前称为Ra)是另一个采用OLC方法进行汇编的工具29。
它的重叠步骤与Minimap2共享算法,而它的一致步骤是基于Racon的,这使得它类似于Miniasm/Minipolish。
它的不同之处在于它的布局步骤,包括新的方法来消除假重叠从图,有助于提高汇编的连续性。
Redbean(以前称为Wtdbg2)使用一种称为模糊Bruijn图的长读程序集方法30。
这模仿了广泛应用于短读汇编31的De Bruijn图的概念,但经过修改以处理长读噪声中出现的不精确序列匹配。
Shasta是一个为计算效率而设计的汇编程序。
为了实现这一点,它的大部分汇编管道不是直接在读取序列上执行的,而是在标记k-mers的减少表示上执行的。
这些标记被用来寻找重叠和建立一个组装图,从中得到一个一致序列。

Computational environment

All assemblies were run on Ubuntu 18.04 instances of Australia’s Nectar Research Cloud which contained 32 vCPUs and 128 GB of RAM (r3.xxlarge flavour). To guard against performance variation caused by vCPU overcommit, the assemblers were limited to 16 threads (half the number of available vCPUs) in their options. Any assembly which exceeded 24 hours of runtime or 128 GB of memory usage was terminated.

Assembly assessment

Our primary metric of assembly quality was contiguity, defined here as the longest single Minimap2 alignment between the assembly and the reference replicon, relative to the reference replicon length. This provides a simpler picture of assembly quality than is created by QUAST (which quantifies misassemblies and other metrics such as NG50) but is appropriate for cases where complete assembly is likely 2. Contiguity of exactly 100% indicates that the replicon was assembled completely with no missing or extra sequence ( Extended data, Figure S5A) 9. Contiguity of slightly less than 100% (e.g. 99.9%) indicates that the assembly was complete, but some bases were lost at the start/end of the contig ( Extended data, Figure S5B) 9. Contiguity of more than 100% (e.g. 101%) indicates that the contig contains duplicated sequence via start-end overlap ( Extended data, Figure S5C) 9. Much lower contiguity (e.g. 70%) indicates that the assembly was not complete due to fragmentation ( Extended data, Figure S5D) 9, missing sequence ( Extended data, Figure S5E) 9 or misassembly ( Extended data, Figure S5F) 9. Contiguity values were determined by aligning the contigs to a tripled version of the reference replicon, necessary to ensure that contigs can fully align even with start-end overlap and regardless of their starting position relative to that of the linearised reference sequence ( Extended data, Figure S6) 9. To encourage longer alignments, Minimap2 was run with the asm20 preset and chain elongation and banding thresholds of 10 kbp. The script for conducting this analysis (assess_assembly.py) is available in Extended data.

Contiguity values were determined for each replicon in the assemblies – e.g. if a genome contained two plasmids, then the assemblies of that genome have three contiguity values: one for the chromosome and one for each plasmid. A status of ‘fully complete’ was assigned to assemblies where all replicons (the chromosome and any plasmids if present) achieved a contiguity of 99%. If an assembly had a chromosome with a contiguity of 99% but incomplete plasmids, it was given a status of ‘complete chromosome’. If the chromosome had a contiguity of <99%, the assembly was deemed ‘incomplete’. If the assembly was empty or missing (possibly due to the assembler prematurely terminating with an error), it was given a status of ‘empty’. Computational metrics were also observed for each assembly: time to complete and maximum RAM usage.

连续值是确定每个复制子在装配中的,例如,如果一个基因组包含两个质粒,那么该基因组的装配有三个连续值:一个染色体和一个质粒。
当所有复制子(染色体和质粒如果存在的话)的接触达到99%时,就会被分配到一个完全完整的组装状态。
如果一个装配体有一条染色体的接近率为99%但质粒不完整,则该装配体为完全染色体。
如果染色体的接近度为99%,则认为该装配是不完整的。
如果程序集为空或丢失(可能是由于汇编程序过早地以错误终止),则将其状态设置为空。
还观察了每个装配的计算指标:完成时间和RAM的最大使用量。

Results and discussion

Figure 1 and Figure 2 summarise the assembly results for the simulated and real read sets, respectively. Full tabulated results can be found in the Extended data 9. The assemblies, times and terminal outputs generated by each assembler are available as Underlying data 33.

Figure 1.

 
An external file that holds a picture, illustration, etc.
Object name is f1000research-8-25588-g0000.jpg
Assembly results for the simulated read sets, which cover a wide variety of parameters for length, depth and quality.

A) Proportion of each possible assembly outcome.

B) Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation.

C) Sequence identity of each assembly’s longest alignment to the chromosome.

D) Total time taken (wall time) for each assembly.

E) Maximum RAM usage for each assembly. ‘Miniasm+’ here refers to the entire Miniasm/Minipolish assembly pipeline.

Figure 2.

 
An external file that holds a picture, illustration, etc.
Object name is f1000research-8-25588-g0001.jpg
Assembly results for the real read sets, half containing ONT MinION reads (circles) and half PacBio RSII reads (X shapes).

A) Proportion of each possible assembly outcome. ( B) Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation. ( C) Sequence identity of each assembly’s longest alignment to the chromosome. ( D) Total time taken (wall time) for each assembly. ( E) Maximum RAM usage for each assembly. ‘Miniasm+’ here refers to the entire Miniasm/Minipolish assembly pipeline.

Figure 1AFigure 2A shows the proportion of read sets with each assembly status. For the real read sets, a higher proportion of completed assemblies indicates a more reliable assembler – one which is likely to make a completed assembly given a typical set of input reads. For the simulated read sets, a higher proportion of completed assemblies indicates a more robust assembler – one which is able to tolerate a wide range of input read parameters, including adverse conditions such as low read accuracy and low read depth (conditions present in some of the simulated read sets but not in the real read sets). Extended data, Figure S7 9 plots assembly contiguity against specific read set parameters to give a more detailed assessment of robustness. Plasmid assembly status, plotted with plasmid length and read depth, is shown in Extended data, Figure S8 and Figure S9 9 for the simulated and real read sets, respectively.

Figure 1BFigure 2B shows the chromosome contiguity values for each assembly, focusing on the range near 100%. These plots show how well assemblers can circularise contigs – i.e. whether sequence is duplicated or missing at the contig start/end ( Extended data, Figure S5) 9. The closer contiguity is to 100% the better, with exactly 100% indicating perfect circularisation. Plasmid contiguity values are shown in Extended data, Figure S10 9.

Assembly identity (consensus identity) is a measure of the base-level accuracy of an assembled contig relative to the reference sequence (how few substitution and small indel errors are present) and is shown in Figure 1CFigure 2C. The identity of assembled sequences is almost always higher than the identity of individual reads because errors can be ‘averaged out’ using read depth, producing more accurate consensus base calls. However, systematic read errors (e.g. mistakes in homopolymer length) can make perfect sequence identity difficult to achieve, regardless of assembly strategy 34.

Assembler resource usage is shown in terms of total runtime ( Figure 1DFigure 2D) and the maximum RAM usage during assembly ( Figure 1EFigure 2E).

Reliability

When considering only the chromosome, Raven was the most reliable assembler, followed by Flye – both were able to complete the chromosome in over three-quarters of the real read sets ( Figure 2A). If plasmids are also considered, then Flye was the most reliable assembler. NECAT, Miniasm/Minipolish and Canu were moderately reliable, completing over half of the real read set chromosomes. Redbean and Shasta were the least reliable and completed less than half of the chromosomes.

Robustness

Flye, Miniasm/Minipolish and Raven were the most robust assemblers, able to complete over half of the assemblies attempted with the simulated read sets ( Figure 1A). Flye and Redbean performed best in cases of low read depth, able to complete assemblies down to ~10× depth ( Extended data, Figure S7A) 9. Raven performed the best with low-identity read sets ( Extended data, Figure S7B) 9. The assemblers performed similarly with regards to read length, except for Shasta which required longer reads ( Extended data, Figure S7C) 9. The assemblers were similarly unaffected by random reads, junk reads, chimeric reads or adapter sequences ( Extended data, Figure S7D–F) 9. Read glitches (local breaks in continuity) were more likely to cause assembly problems for NECAT, Redbean and Shasta ( Extended data, Figure S7G) 9.

可靠性
当只考虑染色体时,Raven是最可靠的装配者,其次是Flye——它们都能够完成超过四分之三的实际读集的染色体(图2A)。
如果也考虑质粒,那么Flye是最可靠的组装器。
NECAT, Miniasm/Minipolish和Canu是比较可靠的,完成了超过一半的真正读集染色体。
红豆和沙斯塔是最不可靠的,他们的染色体还不到一半。

鲁棒性
Flye、Miniasm/Minipolish和Raven是最健壮的组装器,能够完成模拟读集的一半以上的组装(图1A)。
Flye和Redbean在低读取深度的情况下表现最好,能够完成在~10倍深度的组装(扩展数据,图S7A) 9。
Raven在低身份读取集时表现最好(扩展数据,图S7B) 9。
汇编程序在读取长度方面的执行情况类似,除了Shasta需要更长的读取(扩展数据,图S7C) 9。
装配器同样不受随机读、垃圾读、嵌合读或适配器序列的影响(扩展数据,图S7D-F) 9。
读取故障(局部连续性中断)更容易导致NECAT、红豆和Shasta的装配问题(扩展数据,图S7G) 9。

Identity

In our real read tests, Flye achieved the highest overall assembled sequence identity ( Figure 2C). Canu achieved high sequence identity on PacBio reads, Miniasm/Minipolish and Raven did well on ONT reads. For each assembler, real PacBio reads resulted in higher identities than real ONT reads. For the simulated reads (which contain artificial error profiles), results were more erratic, with Canu and Flye performing best ( Figure 1C).

The nature of read errors depends on the sequencing platform and basecalling software used, so these results may not hold true for all read sets. Platform-specific post-assembly polishing tools (including Nanopolish 7, Medaka 35 and Arrow 36) are routinely used to improve the accuracy of long-read assemblies 37, and these can often achieve assembly identities of >99.9% for ONT read sets and >99.999% for PacBio read sets (i.e. better than any of the assemblers were able to achieve on their own). Identity can be further increased by polishing with Illumina reads where available (e.g. with Pilon 38). Therefore, the sequence identity produced by the assembler itself is potentially unimportant for many users.

身份
在我们的真实读取测试中,Flye获得了最高的整体组装序列识别(图2C)。
Canu在PacBio reads上获得了高序列识别,Miniasm/Minipolish和Raven在ONT reads上表现良好。
对于每个汇编程序,真正的PacBio读取会比真正的ONT读取产生更高的身份。
对于模拟读数(包含人为的错误轮廓),结果更加不稳定,其中Canu和Flye表现最佳(图1C)。
读取错误的性质取决于所使用的测序平台和碱基设置软件,因此这些结果可能并不适用于所有的读取集。
特定于平台的post-assembly抛光工具(包括Nanopolish 7、青鳉35和箭头36)通常用于提高读的准确性总成37岁,而这些往往可以实现装配的身份在汽水机读集和比为99.9%;99.999% PacBio读集(即比任何的汇编器能够实现自己)。
如果有可能,还可以通过Illumina reads抛光(例如,使用Pilon 38)来进一步提高鉴定。
因此,汇编程序本身产生的序列标识对于许多用户可能并不重要。

Resource usage

Canu was the slowest assembler tested on both real ( Figure 2D) and simulated ( Figure 1D) read sets, sometimes taking hours to complete. Its runtime was correlated with read accuracy and read set size, with low-accuracy and large read sets being more likely to result in a long runtime.

Flye was typically faster than Canu, taking less than 15 minutes for the real read sets and usually less than an hour for the simulated read sets. It sometimes took multiple hours to assemble simulated read sets, and this was correlated with the amount of junk (low-complexity) reads, suggesting that removal of such reads via pre-assembly QC may be beneficial. Flye had the highest RAM usage of the tested assemblers and its RAM usage was correlated with read N50 and read set size, with long and large read sets being more likely to result in high RAM usage.

Miniasm/Minipolish, NECAT, Raven and Redbean were comparable in performance, typically completing assemblies in less than 15 minutes and with less than 16 GB of RAM. While not tested in this study, Racon (which is used in Minipolish) and Raven can be run with GPU acceleration to further improve speed performance. Shasta was the fastest assembler and had the lowest memory usage.

资源使用情况
Canu是在真实读取集(图2D)和模拟读取集(图1D)上测试的最慢的汇编程序,有时需要花费数小时来完成。
它的运行时间与读取精度和读取集大小相关,低精度和大读取集更有可能导致长时间运行。
Flye通常比Canu快,实际读取集的时间不到15分钟,而模拟读取集的时间通常不到1小时。
装配模拟读集有时需要花费多个小时,这与垃圾(低复杂度)读的数量有关,表明通过预装配QC去除这些读可能是有益的。
在被测试的汇编程序中,Flye的RAM使用率最高,并且它的RAM使用率与读N50和读集大小相关,长和大的读集更有可能导致高RAM使用率。
Miniasm/Minipolish、NECAT、Raven和Redbean在性能上是相当的,通常在15分钟内完成组装,内存不到16gb。
虽然没有在本研究中测试,Racon(在Minipolish中使用)和Raven可以使用GPU加速来进一步提高速度性能。
Shasta是最快的汇编程序,内存使用量最低。

Circularisation

Of all assemblers tested, Miniasm/Minipolish and NECAT most regularly achieved exact circularisation (contiguity=100%) ( Figure 1BFigure 2B). Flye often excluded a small amount of sequence (tens of bases) from the start/end of circular contigs (contiguity <100%), and Raven typically excluded moderate amounts of sequence (hundreds of bases). Canu’s contiguities usually exceeded 100%, indicating a large amount (thousands of bases) of start/end overlap. The amount of overlap in a Canu assembly was correlated with the read N50 length ( Extended data, Figure S7C) 9. Redbean and Shasta were both erratic in their circularisation, often producing some sequence duplication (contiguity >100%) but occasionally dropping sequence (contiguity <100%).

In addition to cleanly circularising contig sequences, it is valuable for a prokaryote genome assembler to clearly distinguish between circular and linear contigs. This can provide users with a clue as to whether or not the genome was assembled to completion. Flye, Miniasm/Minipolish and Shasta produce graph files of their final assembly which can indicate circularity. Canu indicates circularity via the ‘suggestCircular’ text in its contig headers. NECAT, Raven and Redbean do not signal to users whether a contig is circular.

环化
在所有测试的装配机中,Miniasm/Minipolish和NECAT最经常达到精确循环(连贯性=100%)(图1B/图2B)。
Flye经常从循环重叠的开始/结束处排除少量的序列(数十个碱基)(接近度&lt;100%),而Raven通常排除中等数量的序列(数百个碱基)。
Canu的连续度通常超过100%,说明有大量(数千个基)的开始/结束重叠。
Canu组装中的重叠量与读取N50长度相关(扩展数据,图S7C) 9。
红豆和沙斯塔在它们的循环中都不稳定,经常产生序列重复(连续性100%),但偶尔会出现序列下降(连续性100%)。
除了清晰循环的叠群序列外,对原核生物基因组组装者来说,清楚地区分环状叠群和线性叠群是有价值的。
这可以为用户提供基因组是否已组装完成的线索。
Flye, Miniasm/Minipolish和Shasta制作最终组装的图形文件,可以显示圆度。
Canu通过其contig标头中的suggestCircular文本来表示循环性。
NECAT、Raven和Redbean不向用户发送是否为循环的contig信号。

Plasmids

Canu and Flye were the two assemblers most able to assemble plasmids at a broad range of size and depth ( Extended data, Figures S8, S9) 9. Miniasm/Minipolish also performed well, though it failed to assemble plasmids if they were very small or had a very high read depth. Raven was able to assemble most large plasmids but not small plasmids. NECAT, Redbean and Shasta were least successful at plasmid assembly.

Circularisation of plasmids followed the same pattern as for chromosomes, with only Miniasm/Minipolish consistently achieving clean circularisation ( Extended data, Figure S10) 9. For smaller plasmids, start/end overlap could sometimes result in contiguities of 200% – i.e. the plasmid sequence was duplicated in a single contig. This was most common with Canu, though it occurred with other assemblers as well.

质粒
Canu和Flye是两个最能组装大范围和深度质粒的组装者(扩展数据,图S8, S9) 9。
Miniasm/Minipolish的表现也很好,但如果质粒非常小或读取深度非常高,它就无法组装质粒。
Raven能够组装大部分大型质粒,而不是小型质粒。
NECAT、Redbean和Shasta组装质粒的成功率最低。
质粒的循环遵循与染色体相同的模式,只有Miniasm/Minipolish一致地实现干净的循环(扩展数据,图S10) 9。
对于较小的质粒,开始/结束重叠有时会导致200%的连续性,也就是说,质粒序列在一个重叠群中被重复。
这在Canu中最常见,但在其他汇编器中也会发生。

Ease of use

All assemblers tested were relatively easy to use, either running with a single command (Canu, Flye, NECAT, Raven and Shasta) or providing a convenience script to bundle the commands together (Miniasm/Minipolish and Redbean). All were able to take long reads in FASTQ format as input ( Extended data, Figure S4) 9. We encountered no difficulty installing any of the tools by following the instructions provided.

Some of the assemblers needed a predicted genome size as input (Canu, Flye, NECAT and Redbean) while others (Miniasm/Minipolish, Raven and Shasta) did not. This requirement could be a nuisance when assembling unknown isolates, as it may be hard to specify a genome size before the species is known.

易用性
测试的所有汇编程序使用起来都比较容易,要么只运行一个命令(Canu、Flye、NECAT、Raven和Shasta),要么提供一个方便的脚本将命令捆绑在一起(Miniasm/Minipolish和Redbean)。
所有的数据都能够以FASTQ格式长时间读取作为输入(扩展数据,图S4) 9。
按照提供的说明安装任何工具都没有遇到任何困难。
一些装配者需要预测基因组大小作为输入(Canu, Flye, NECAT和红豆),而其他的(Miniasm/Minipolish, Raven和Shasta)则不需要。
当组装未知的分离物时,这一要求可能是一个麻烦,因为在物种被知道之前,很难指定一个基因组的大小。

Configurability

While we ran our assemblies using default and/or recommended commands ( Extended data, Figure S4) 9, some of the assemblers have parameters which can be used to alter their behaviour. Raven was the least configurable assembler tested, with few options available to users. Flye offers some parameters, including overlap and coverage thresholds. Miniasm/Minipolish, NECAT, Redbean and Shasta all offer more options, and Canu is the most configurable with hundreds of adjustable parameters. Many of the available parameters are arcane (e.g. Miniasm’s ‘max and min overlap drop ratio’ or Shasta’s ‘pruneIterationCount’), and only experienced power users are likely to adjust them – most will likely stick with default settings or only adjust easier-to-understand options. However, the presence of low-level parameters provides an opportunity to experiment and gain greater control over assemblies and are therefore appreciated even when unlikely to be used.

Another aspect worth noting is whether an assembler produces useful files other than its final assembly. Canu and NECAT stand out in this respect, as they create corrected and trimmed reads in their pipelines which have low error rates and are mostly free of adapters and chimeric sequences. Canu and NECAT can therefore be considered not just assemblers but also long-read correction tools suitable for use in other analyses.

可配置性
当我们使用默认和/或推荐的命令(扩展数据,图S4) 9运行我们的程序集时,一些汇编器具有可以用来改变其行为的参数。
Raven是经过测试的最不容易配置的汇编程序,可供用户使用的选项很少。
Flye提供了一些参数,包括重叠和覆盖阈值。
Miniasm/Minipolish, NECAT,红豆和Shasta都提供了更多的选项,而Canu是最可配置的数百个可调参数。
很多可用的参数都是神秘的(比如Miniasm的最大和最小重叠下降比率,或者Shasta的pruneIterationCount),只有经验丰富的高级用户才有可能调整它们,他们可能会坚持使用默认设置,或者只调整容易理解的选项。
然而,低级别参数的存在提供了一个试验的机会,可以对程序集进行更大的控制,因此即使不太可能使用,也会受到重视。
另一个值得注意的方面是汇编程序是否生成有用的文件,而不是最终的汇编程序。
Canu和NECAT在这方面很突出,因为他们在管道中创建了修正和裁剪的读取,这具有低错误率,并且大部分没有适配器和嵌合序列。
因此,Canu和NECAT不仅可以被认为是汇编程序,而且可以被认为是适合在其他分析中使用的长时间阅读的修正工具。

Assembler summaries

Canu v1.9 was the slowest assembler and not the most reliable or robust. Its strength is in its configurability, so power users who are willing to learn Canu’s nuances may find that they can tune it to fit their needs. However, it is probably not the best choice for users wanting a quick and simple prokaryote genome assembly.

Flye v2.7 was an overall strong performer in our tests: reliable, robust and good with plasmids. However, it requires a genome size parameter, tended to delete some sequence (usually on the order of tens of bases) when circularising contigs and could be excessive in its RAM usage when assembling simulated read sets.

Miniasm/Minipolish v0.3 was not the most reliable assembler but was fairly robust to read set parameters. Its main strength is that it was the assembler most likely to consistently achieve perfect contig circularisation (as this is a specific goal of its polishing step). Also, it does not require a genome size parameter to run, which makes it easier to run than Canu, Flye or Redbean for unknown genomes.

NECAT v20200119 performed reliably with chromosome assembly in the real read sets and was second only to Miniasm/Minipolish for contig circularisation. However, it performed worse on simulated reads (i.e. was less robust) and failed to assemble many plasmids.

Raven v0.0.8 was the most reliable and robust assembler for chromosome assembly. However, it suffered from worse circularisation problems than Flye (often deleting hundreds of bases) and wasn’t good with small plasmids. Like Miniasm/Minipolish, it does not require a genome size parameter.

Redbean v2.5 assemblies tended to have glitches in the sequence which caused breaks in contiguity, making it perform poorly in both reliability and robustness. This, combined with its erratic circularisation performance and requirement to specify genome size, make it a less-than ideal choice for long-read prokaryote read sets.

Shasta v0.4.0 was the fastest assembler tested and used the least RAM, but it had the worst reliability and robustness. It is therefore more suited to assembly of large genomes in resource-limited settings (the use case for which it was designed) than it is for prokaryote genome assembly.

组装器总结
Canu v1.9是最慢的汇编程序,但不是最可靠或最健壮的。
它的优点在于可配置性,所以愿意学习Canu细微差别的高级用户可能会发现他们可以调整它来满足自己的需求。
然而,对于想要快速和简单的原核生物基因组组装的用户来说,它可能不是最好的选择。
在我们的测试中,Flye v2.7总体上表现良好:可靠、稳健、质粒良好。
然而,它需要一个基因组大小参数,在循环contigs时倾向于删除一些序列(通常是几十个碱基的顺序),并且在组装模拟读集时可能会过度使用RAM。
Miniasm/Minipolish v0.3不是最可靠的汇编器,但在读取设置参数方面相当健壮。
它的主要优点是,它是装配者最有可能始终如一地实现完美的重叠循环(因为这是它抛光步骤的一个特定目标)。
此外,它不需要基因组大小参数来运行,这使得它比Canu、Flye或Redbean更容易运行未知基因组。
NECAT v20200119在真实读集中进行染色体装配,在重叠循环方面仅次于Miniasm/Minipolish。
然而,它在模拟reads上的表现更差(即不那么稳健),并且未能组装许多质粒。
Raven v0.0.8是最可靠、最健壮的染色体装配器。
然而,它的循环问题比Flye(经常删除数百个碱基)更严重,而且对小质粒也不太好。
像Miniasm/Minipolish一样,它不需要基因组大小参数。
Redbean v2.5程序集在序列中容易出现故障,导致连续中断,可靠性和鲁棒性都较差。
这一点,加上它不稳定的循环性能和要求指定基因组的大小,使它不是一个理想的选择长期阅读原核生物阅读集。
Shasta v0.4.0是测试中最快的汇编程序,使用的RAM最少,但可靠性和健壮性最差。
因此,它更适合在资源有限的环境下组装大的基因组(它被设计的用例),而不是原核生物基因组组装。

Conclusions

Each of the different assemblers has pros and cons, and while no single assembler emerged as an ideal choice for prokaryote genome long-read assembly, the overall best performers were Flye, Miniasm/Minipolish and Raven. Flye was very reliable, especially for plasmid assembly, and was the best performing assembler at low read depths. Miniasm/Minipolish was the only assembler to reliably achieve clean contig circularisation. Raven was the most reliable for chromosome assembly and the most tolerant of low-identity read sets.

For users looking to achieve an optimal assembly, we recommend trying multiple different tools and comparing the results. This will provide the opportunity for validation – confidence in an assembly is greater when it is in agreement with other independent assemblies. It also offers a chance to detect and repair circularisation issues, as different assemblers are likely to give different contig start/end positions for a circular replicon.

An ideal prokaryotic long-read assembler would reliably complete assemblies, be robust against read set problems, be easy to use, have low computational requirements, cleanly circularise contigs and assemble plasmids of any size. The importance of long-read assembly will continue to grow as long-read sequencing becomes more commonplace in microbial genomics, and so development of assemblers towards this ideal is crucial.

结论
每一种不同的装配者都有优缺点,虽然没有一种装配者成为原核生物基因组长read装配的理想选择,但总体上表现最好的装配者是Flye, Miniasm/Minipolish和Raven。
Flye非常可靠,特别是在质粒组装方面,是低阅读深度下表现最好的组装器。
Miniasm/Minipolish是唯一可靠地实现清洁重叠循环的装配工。
Raven对染色体装配最可靠,对低身份读集的耐受性最强。
对于希望实现最佳组装的用户,我们建议尝试多种不同的工具并比较结果。


当一个程序集与其他独立程序集一致时,这将为该程序集的验证信心提供更大的机会。
它还提供了检测和修复循环问题的机会,因为不同的汇编程序可能会为循环复制子提供不同的contig起始/结束位置。
理想的原核长读组装程序应该可靠地完成组装,对读集问题具有鲁棒性,易于使用,计算要求低,能够清晰地循环contigs和组装任何大小的质粒。
随着长读测序在微生物基因组学中变得越来越普遍,长读组装的重要性将继续增长,因此向这一理想的组装程序的发展是至关重要的。

Data availability

Underlying data

Figshare: Read sets. https://doi.org/10.26180/5df6f5d06cf04 16.

These files contain the input read sets (both simulated and real) for assembly.

Figshare: Reference genomes. https://doi.org/10.26180/5df6e99ff3eed 17.

This file contains the reference genomes against which the long-read assemblies were compared. For the simulated read sets, these genomes were the source sequence from which the reads were generated.

Figshare: Assemblies. https://doi.org/10.26180/5df6e2864a658 33.

These files contain assemblies (in FASTA format), times and terminal outputs for each of the assemblers.

Extended data

Zenodo: Long-read-assembler-comparison. https://doi.org/10.5281/zenodo.2702442 9.

This project contains the following extended data:

  • Results (tables of results data, (including information on each reference genome, read set parameters and metrics foreach assembly).

  • Scripts (scripts used to assess assemblies and generate plots).

  • Figure S1. Distributions of chromosome sizes (A), plasmid sizes (B) and per-genome plasmid counts (C) for the reference genomes used to make the simulated read sets.

  • Figure S2. Badread parameter histograms for the simulated read sets. (A) Mean read depths were sampled from a uniform distribution ranging from 5× to 200×. (B) mean read lengths were sampled from a uniform distribution ranging from 100 to 20000 bp. C: read length standard deviations were sampled from a uniform distribution ranging from 100 to twice that set’s mean length (up to 40000 bp). D: mean read identities were sampled from a uniform distribution ranging from 80% to 99%. (E) Max read identities were sampled from a uniform distribution ranging from that set’s mean identity plus 1% to 100%. (F) Read identity standard deviations were sampled from a uniform distribution ranging from 1% to the max identity minus the mean identity. (G, H and I) Junk, random and chimera rates were all sampled from an exponential distribution with a mean of 2%. (J) Glitch sizes/skips were sampled from a uniform distribution ranging from 0 to 100. (K) Glitch rates for each set were calculated from the size/skip according to this formula: 100000 /1.6986 s/10. (L) Adapter lengths were sampled from an exponential distribution with a mean of 50.

  • Figure S3. Top: the target simulated depth of each replicon relative to the chromosome. The smaller the plasmid, the wider the range of possible depths. Bottom: the absolute read set of each replicon after read simulation.

  • Figure S4. Commands used for each of the seven assemblers tested.

  • Figure S5. Possible states for the assembly of a circular replicon. Reference sequences are shown in the inner circles in black and aligned contig sequences are shown in the outer circles in colour (red at the contig start to violet at the contig end). (A) Complete assembly with perfect circularisation. (B) Complete assembly but with missing bases leading to a gapped circularisation. (C) Complete assembly but with duplicated bases leading to overlapping circularisation. (D) Incomplete assembly due to fragmentation (multiple contigs per replicon). (E) Incomplete assembly due to missing sequence. (F) Incomplete assembly due to misassembly (noncontiguous sequence in the contig).

  • Figure S6. Reference triplication for assembly assessment. (A) Due to the ambiguous starting position of a circular replicon, a completely-assembled contig will typically not align to the reference in a single unbroken alignment. (B) Doubling the reference sequence will allow for a single alignment, regardless of starting position. (C) However, if the contig contains start/end overlap (i.e. contiguity >100%) then even a doubled reference may not be sufficient to achieve a single alignment, depending on the starting position. (D) A tripled reference allows for an unbroken alignment, regardless of starting position, even in cases of >100% contiguity.

  • Figure S7. Contiguity of the simulated read set assemblies plotted against Badread parameters for each of the tested assemblers. These plots show how well the assemblers tolerate different problems in the read sets. (A) Mean read depth (higher is better). (B) Max read identity (higher is better). (C) N50 read length (higher is better). (D) The sum of random read rate and junk read rate (lower is better). (E) Chimeric read rate (lower is better). (F) Adapter sequence length (lower is better). (G) Glitch size/skip (lower is better).

  • Figure S8. Plasmid completion for the simulated read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity 99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.

  • Figure S9. Plasmid completion for the real read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity 99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.

  • Figure S10. The relative contiguity of the plasmids for each real read set assembly (A) and simulated read set assembly (B).

  • Figure S11. The maximum indel size in the longest alignment to the chromosome for each real read set assembly (A) and simulated read set assembly (B).

Extended data are also available on GitHub.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgements

This research was supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).

Notes

[version 2; peer review: 4 approved]

Funding Statement

This work was supported by the Bill & Melinda Gates Foundation, Seattle (grant number OPP1175797) and an Australian Government Research Training Program Scholarship. KEH is supported by a Senior Medical Research Fellowship from the Viertel Foundation of Victoria.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

原文地址:https://www.cnblogs.com/wangprince2017/p/13756163.html