Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Nan Du,
Jiao Chen &
Yanni Sun

BMC Genomics volume 20, Article number: 190 (2019) Cite this article

1645 Accesses
7 Altmetric
Metricsdetails

Abstract

Background

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.

Results

In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.

Conclusions

GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.

Background

The increased read length enables third-generation sequencing to close gaps in genome assembly [1, 2], reveal structural variations [3], and quantify gene isoforms with higher accuracy [4] in transcriptomic sequencing. In addition, using long reads holds promise in revealing the microbial community structure and deciphering intra-species variation for microbial communities [5, 6].

Genome assembly using third-generation sequencing data requires dedicated methods and tools. Existing genome assembly tools mainly utilize two types of graph models: overlap graph and de Bruijn graph. When the error rate is low, de Bruijn graph has the theoretical advantage that the graph size does not increase significantly with the sequencing coverage, which is usually high for Illumina datasets. For third-generation sequencing data, the high error rate and low coverage make the overlap graph a sensible choice for genome assembly [7]. A key step in constructing the overlap graph is to identify read pairs that share overlaps, which indicates that these reads are sequenced from the same loci in the underlying genome. Although there are a number of sequence alignment programs available for conducting overlap alignment [8, 9], a majority of them rely on dynamic programming and are too computationally expensive for high throughput sequencing data. Due to high error rates, existing short read overlap detection software using BWT (Burrows-Wheeler transform) or hash table [10, 11] cannot be directly applied to long reads.

Related work

Two strategies are currently being employed to detect overlaps for error-prone long reads. One strategy tries to correct sequencing errors in PacBio (Pacific Biosciences) and ONT (Oxford Nanopore) data before overlap detection. There exist a number of sequencing error correction tools [7, 12]. Some of them rely on hybrid sequencing, which requires preparation of at least two sequencing libraries and several types of sequencing runs and thus is not cost-effective for many applications. Others conduct error correction using long reads only. One representative method is described in Chin et al.’s hierarchical genome-assembly process HGAP [12], whose performance improves with higher read coverage. It is worth noting that for alignment-based error correction methods such as the one in HGAP, an important step is to identify reads that can be aligned quickly. Essentially, techniques used for overlap detection can be used for alignment detection as well.

The second category bypasses the difficulty of error correction and identifies overlaps using raw reads. Various approximate similarity search methods have been applied on PacBio and ONT data [13]. They generally follow seed-chain-align procedure [14]. Seed-based filtration step plays an essential role in controlling the trade-off between sensitivity and computational efficiency. Usually, these methods use short string matches as the filtration step. A short string or k-mer match requires exact matches of k consecutive characters between two sequences. Intuitively, overlapping reads tend to share more common k-mers than non-overlapping reads. Strategies that can quickly find the number of shared k-mers can thus be applied. In this section, we summarize the main strategies of several state-of-the-art overlap detection tools. We highlight the differences between our method and the existing ones in the following section.

MHAP [15], Minimap [14, 16], and DALIGNER [17] all use k-mer matches for identifying candidate overlapping pairs. Due to the high error rate, usually only short k-mers will be applied in order to achieve high sensitivity. However, identifying short k-mer matches between all pairs of reads is computationally expensive. Thus, the leading tools employed different data structures and algorithms for estimating k-mer-based similarity. MHAP converts long reads into sets of k-mers and sketches using minHash. Then the similarity between reads is estimated using the compact sketches. Minimap also uses a compact representation of the original reads by keeping minimizers rather than all possible k-mers of a read. Then collinear k-mers will be clustered and used for checking possible overlaps. DALIGNER directly sorts k-mers based on their positions and then utilizes merge sort to identify the number of shared k-mers. As the sorting is cache efficient, DALIGNER is practically very efficient.

BLASR [18] was initially designed for mapping PacBio reads to a reference genome. It is also widely used as an overlap detection tool for PacBio data. BLASR uses BWT and FM index to identify short k-mer matches and then clusters k-mer matches within a given distance range. The clustered k-mers are ranked based on a function of the k-mer frequency. Only highly ranked clusters will be kept for downstream analysis.

Being different from the above tools, GraphMap [19] uses spaced seeds that allow matches of non-consecutive characters. Spaced seeds were initially used in homology search for improving the trade-off between sensitivity and filtration efficiency [20–22]. In particular, spaced seeds containing the pattern “11*” have high sensitivity in capturing homologous protein-coding genes because of the codon structure. However, designing optimal spaced seeds (i.e., deciding the positions of the wildcard characters) is NP-hard [23, 24]. GraphMap empirically chooses two spaced seeds. Ideally, different sets of seeds may be designed for input data of different error profiles.

Overview of our work

The high error rates and also the different error profiles of PacBio and ONT data motivate us to use a more flexible seeding strategy called group hit criteria [25], which define a group of possibly overlapping k-mers satisfying statistically derived distance constrains. For brevity, we will call the k-mers set satisfying the group hit criteria as a “group seed”. A group seed was initially proposed and used for homology search. Given the error profiles, such as the estimated indels and mismatch probabilities, thresholds for grouping short k-mers can be computed using the waiting time distribution and the one-dimensional random walk [25]. A group seed can effectively handle all types of errors and is ideal to detect small overlaps. With group seeds, we can achieve high sensitivity using short k-mers (e.g., 9-mer) while still maintaining a desirable specificity.

In this work, we employ group seeds for detecting overlapping long reads for genome assembly. Our implementation, named GroupK, provides a complementary tool to existing methods for detecting small overlaps or overlaps compounded by high error rates of both reads. This ability enables our tool a sensible choice for genome assembly in metagenomic data sequenced by third-generation sequencing platforms. As these community samples usually contain microorganisms with heterogeneous coverage, being able to identify small overlaps will be very important for reconstructing genomes of rare species.

Methods

GroupK is designed for improving the sensitivity of detecting small overlaps or overlaps with low identity. Currently, third-generation sequencing data still has high error rates. The overlapping regions formed by two error-prone long reads can have lower sequence identity than mapping a long read against a reference genome. Figure 1 presents the histogram of the overlap size and the corresponding ratio of overlap size to the read length between two adjacent reads. The reads are simulated using PBSIM [26] from E. coli with three different coverages. As we know the position of each simulated read in the genome, the overlap size can be easily decided. Note that a read can form overlaps with multiple reads sequenced from the same region. However, the two figures are generated using overlaps between two adjacent reads, which define an “irreducible” edge [27] in an overlap graph. Thus, these overlaps can decide the continuity of the final genome assembly. The figures show that there are still substantial regions with small overlaps. For example, there are 45.46%, 34.76%, and 31.19% of the overlaps shorter than the 50% of the read length for data with coverage of 8X, 15X, and 30X, respectively. It will be ideal to detect relatively small overlaps to fully take advantage of the long reads for generating more complete assemblies.