Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

长读纠错软件应用于纳米孔rna测序数据的比较评估

Abstract

Motivation

Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited.

Results

In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type.

Benchmarking software

https://gitlab.com/leoisl/LR_EC_analyser

long reads, RNA-sequencing, Nanopore, error correction, benchmark

Issue Section:

Review Article

Introduction

The most commonly used technique to study transcriptomes is through RNA-sequencing. As such, many tools were developed to process Illumina or short RNA-seq reads. Assembling a transcriptome from short reads is a central task for which many methods are available. When a reference genome or reference transcriptome is available, reference-based assemblers can be used (such as Cufflinks [1], Scallop [2], Scripture [3] and StringTie [4]). When no references are available, de novo transcriptome assembly can be performed (using tools such as Oases [5], SOAPdenovo-Trans [6], Trans-ABySS [7] and Trinity [8]). Potential disadvantages of reference-based strategies include the following: (i) the resulting assemblies might be biased toward the used reference, and true variations might be discarded in favour of known isoforms; (ii) they are unsuitable for samples with a partial or missing reference genome [8]; (iii) such methods depend on correct read-to-reference alignment, a task that is complicated by splicing, sequencing errors, polyploidism, multiple read mapping, mismatches caused by genome variation and the lack or incompleteness of many reference genomes [7, 8]; and (iv) sometimes, the model being studied is sufficiently different from the reference because it comes from a different strain or line such that the mappings are not altogether reliable [5]. On the other hand, some of the shortcomings of de novo transcriptome assemblers are as follows: (i) low-abundance transcripts are likely to not be fully assembled [9]; (ii) reconstruction heuristics are usually employed, which may lead to missing alternative transcripts, and highly similar transcripts are likely to be assembled into a single transcript [10]; (iii) homologous or repetitive regions may result in incomplete assemblies [11]; and (iv) accuracy of transcript assembly is called into question when a gene exhibits complex isoform expression [11].

Recent advances in long-read sequencing technology have enabled longer, up to full-length sequencing of RNA molecules. This new approach has the potential to eliminate the need for transcriptome assembly and thus also eliminate from transcriptome analysis pipelines all the biases caused by the assembly step. Long-read sequencing can be done using either cDNA-based or direct RNA protocols from Oxford Nanopore (referred to as ‘ONT’ or ‘Nanopore’) and Pacific Biosciences (PacBio). The Iso-Seq protocol from PacBio consists in a size selection step, sequencing of cDNAs and finally a set of computational steps that produce sequences of full-length transcripts. ONT has three different experimental protocols for sequencing RNA molecules: cDNA transformation with amplification, direct cDNA (with or without amplification) and direct RNA.

Long-read sequencing is increasingly used in transcriptome studies, not just to prevent problems caused by short-read transcriptome assembly, but also for several of the following reasons. Mainly, long reads can better describe exon/intron combinations [12]. The Iso-Seq protocol has been used for isoform identification, including transcripts identification [13], de novo isoform discovery [14] and fusion transcript detection [15]. Nanopore has recently been used for isoform identification [16] and quantification [17].

The sequencing throughput of long-read technologies is significantly increasing over the years. It is now conceivable to sequence a full eukaryote transcriptome using either only long reads, or a combination of high-coverage long and short (Illumina) reads. Unlike the Iso-Seq protocol that requires extensive in silico processing prior to primary analysis [18], raw Nanopore reads can in principle be readily analysed. Direct RNA reads also permit the analysis of base modifications [19], unlike all other cDNA-based sequencing technologies. There also exist circular sequencing techniques for Nanopore such as INC-Seq [20] that aim at reducing error rates, at the expense of a special library preparation. With raw long reads, it is up to the primary analysis software (typically a mapping algorithm) to deal with sequences that have significant per-base error rate, currently around 13% [21].

In principle, a high error rate in the data complicates the analysis of transcriptomes especially for the accurate detection of exon boundaries, or the quantification of similar isoforms and paralogous genes. Reads need to be aligned unambiguously and with high base pair accuracy to either a reference genome or transcriptome. Indels (i.e. insertions/deletions) are the main type of errors produced by long-read technologies, and they confuse aligners more than substitution errors [22]. Many methods have been developed to correct errors in RNA-seq reads, mainly in the short-read era [23, 24]. They no longer apply to long reads because they were developed to deal with low error rates, and principally substitutions. However, a new set of methods has been proposed to correct genomic long reads. There exist two types of long-read error correction algorithms, those using information from long reads only (‘self’ or ‘non-hybrid’ correction), and those using short reads to correct long reads (‘hybrid’ correction). In this article, we will report on the extent to which state-of-the-art tools enable to correct long noisy RNA-seq reads produced by Nanopore sequencers.

Several tools exist for error-correcting long reads, including ONT reads. Even if the error profiles of Nanopore and PacBio reads are different, the error rate is quite similar and it is reasonable to expect that tools originally designed for PacBio data to also perform well on recent Nanopore data. There is, to the best of our knowledge, very little prior work that specifically addresses error correction of RNA-seq long reads. Notable exceptions include the following: (i) LSC [25], which is designed to error correct PacBio RNA-seq long reads using Illumina RNA-seq short reads; (ii) PBcR [26] and (iii) HALC [27], which are mainly designed for genomes but are also evaluated on transcriptomic data. Here we will take the standpoint of evaluating long-read error correction tools on RNA-seq data, most of which were designed to process DNA sequencing data only.

We evaluate the following DNA hybrid correction tools: HALC [27], LoRDEC [28], NaS [29], PBcR [26] and proovread [30]; and the following DNA self-correction tools: Canu [31], daccord [32], LoRMA [33], MECAT [34] and pbdagcon [35]. We also evaluate an additional hybrid tool, LSC [25], the only one specifically designed to error correct (PacBio) RNA-seq long reads. A majority of hybrid correction methods employ mapping strategies to place short fragments on long reads and correct long read regions using the related short-read sequences. But some of them rely on graphs to create a consensus that is used for correction. These graphs are either k-mer graphs (de Bruijn graphs) or nucleotide graphs resulting from multiple alignments of sequences (partial order alignment). For self-correction methods, strategies using the aforementioned graphs are the most common. We have also considered evaluating nanocorrect [36], nanopolish [36], Falcon_sense [37] and LSCPlus [38], but some tools were deprecated, not suitable for read correction or unavailable. Our detailed justifications can be found in Section S1.12 of the Supplementary Data. We have selected what we believe is a representative set of tools but there also exist other tools that were not considered in this study, e.g. HG-Color [39], HECIL [40], MIRCA [41], Jabba [42], nanocorr [43] and Racon [44].