The advantages and disadvantages of short- and long-read metagenomics to infer bacterial and eukaryotic community composition

短读宏基因组学和长读宏基因组学在推断细菌和真核生物群落组成方面的优缺点

Abstract

Background The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities.

Results Here we use simulated error prone Oxford Nanopore and high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus).

Conclusions We then show that for two popular taxonomic classifiers, long error-prone reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

背景
理解生态群落多样性和动态的第一步是量化群落成员。
一种越来越普遍的方法是通过宏基因组学。
由于这种方法的迅速流行，有大量的计算工具和管道可用于分析宏基因组数据。
然而，这些工具中的大多数都是使用高度精确的短读数据(例如illumina)设计和基准测试的，很少有研究对容易出错的长读数据(PacBio或Oxford Nanopore)的分类精度进行基准测试。
此外，很少有工具被作为非微生物群落的基准。

结果
在这里，我们使用模拟易出错的牛津纳米孔和高精度Illumina read集，系统地研究序列长度和分类单元类型对微生物和非微生物群落元基因组数据分类精度的影响。
我们发现，一般来说，非微生物群落的分类精度要低得多，即使分类分辨率很低(例如，科而不是属)。

结论
然后我们表明，对于两种流行的分类分类器，长时间容易出错的读取可以显著提高分类精度，这对非微生物群落最为明显。
这项工作提供了对不同分类组的元基因组分析的预期准确性的见解，并建立了点，在这一点上，读取长度变得比错误率更重要的分配正确的分类单元。