A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases 一种将长读取映射到大型引用数据库的快速近似算法

Chirag Jain
Alexander Dilthey
Sergey Koren
Srinivas Aluru
Adam M. Phillippy

Conference paper

First Online: 12 April 2017

11Citations
17Mentions
2.6kDownloads

Part of the Lecture Notes in Computer Science book series (LNCS, volume 10229)

Abstract

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each

摘要
来自太平洋生物科学公司(Pacific Biosciences)和牛津纳米孔公司(Oxford Nanopore)的新兴单分子测序技术重新激发了人们对长read测序算法的兴趣。
基于对齐的种子和扩展方法展示了良好的准确性，但面临有限的可伸缩性，而更快的无对齐方法通常以降低的精度换取效率。
在本文中，我们将一种基于极小化器的快速近似读取映射算法与一种新的MinHash身份估计技术相结合，以实现可扩展性和精度。
与之前的方法相比，我们开发了一个数学框架来定义我们所发现的映射目标的类型，建立p值和灵敏度的概率估计，并展示了对齐错误率高达20%的容忍度。
有了这个框架，我们的算法自动适应不同的最小长度和身份要求，并为每个报告的映射提供位置和身份估计。
对于将人类PacBio读取映射到hg38引用，我们的方法比BWA-MEM快290x，内存占用更少，召回率为96%。
我们进一步证明了我们的方法的可扩展性，通过映射嘈杂的PacBio读取(每个5 5 kbp长度)到完整的NCBI RefSeq数据库，该数据库包含838英镑的序列和60000 000个基因组。

Keywords

Long read mapping Jaccard MinHash Winnowing Minimizers Sketching Nanopore PacBio

The rights of this work are transferred to the extent transferable according to title 17

一种将长读取映射到大型引用数据库的快速近似算法
作者
作者和联系
邮件作者:sergey KorenSrinivas AluruAdam M. phillippy
1.
2.
会议论文
第一在线:2017年4月12日
11
引用
17
提到
2.6 k
下载
计算机科学系列丛书(LNCS, 10229卷)的部分讲义
摘要
来自太平洋生物科学公司(Pacific Biosciences)和牛津纳米孔公司(Oxford Nanopore)的新兴单分子测序技术重新激发了人们对长时间测序算法的兴趣。
基于对齐的种子和扩展方法展示了良好的准确性，但面临有限的可伸缩性，而更快的无对齐方法通常以降低的精度换取效率。
在本文中，我们将一种基于极小化器的快速近似读取映射算法与一种新的MinHash身份估计技术相结合，以实现可扩展性和精度。
与之前的方法相比，我们开发了一个数学框架来定义我们所发现的映射目标的类型，建立p值和灵敏度的概率估计，并展示了对齐错误率高达20%的容忍度。
有了这个框架，我们的算法自动适应不同的最小长度和身份要求，并为每个报告的映射提供位置和身份估计。
对于将人类PacBio读取映射到hg38引用，我们的方法比BWA-MEM快290x，内存占用更少，召回率为96%。
通过将有噪声的PacBio读取(每个长度≥5 kbp)映射到包含838 Gbp序列和60000个基因组的完整NCBI RefSeq数据库，我们进一步证明了我们方法的可扩展性。

关键字
长读绘图Jaccard MinHash筛选最小化草图纳米孔PacBio
根据《美国法典》第17章第105节，本著作的权利在可转让的范围内转让