Canu Tutorial（canu指导手册）

Canu assembles reads from PacBio RS II or Oxford Nanopore MinION instruments into uniquely-assemblable contigs, unitigs. Canu owes lots of it design and code to celera-assembler.

Canu can be run using hardware of nearly any shape or size（可以在任何硬件上运行）, anywhere from laptops to computational grids with thousands of nodes. Obviouisly, larger assemblies will take a long time to compute on laptops, and smaller assemblies can’t take advantage of hundreds of nodes, so what is being assembled plays some part in determining what hardware can be effectively used.

Most algorithms in canu have been multi-threaded （多线程）(to use all the cores on a single node), parallelized（并行） (to use all the nodes in a grid), or both (all the cores on all the nodes).

Canu, the command

The canu command is the ‘executive’ program that runs all modules of the assembler（canu程序会调用所有模块）. It oversees each of the three top-level tasks (correction, trimming, unitig construction), each of which consists of many steps. Canu ensures that input files for each step exist, that each step successfully finished, and that the output for each step exists. It does minor bits of processing, such as reformatting files, but generally just executes other programs.

canu 
   -d <working-directory> 
   -p <file-prefix> 
  [-s specifications] 
  [-correct | -trim | -assemble] 
   errorRate=<fraction-error> 
   genomeSize=<genome-size>
  [parameters] 
  [-pacbio-raw         <read-file>]
  [-pacbio-corrected   <read-file>]
  [-nanopore-raw       <read-file>]
  [-nanopore-corrected <read-file>]

Two options are mandatory（必须有的两个参数）, -d, to set the working directory, and -p, to set the file name prefix. All work is performed and output appears in the working directory. The directory need not exist before starting. Most files in this directory have file names beginning with the file name prefix, however, running two canu commands in the same directory will probably lead to confusion.

The -s option will import a list of parameters from the supplied specification （利用提供文件里的参数）(‘spec’) file. These parameters will be applied before any from the command line are used, providing a method for setting commonly used parameters, but overriding them for specific assemblies.

By default, all three top-level tasks are performed（默认会自动运行三个任务）. It is possible to run exactly one task by using the -correct, -trim or -assemble options（也可以指定任务单独运行）. These options can be useful if you want to correct reads once and try many different assemblies. We do exactly that in the Canu Quick Start.

Parameters are key=value pairs that configure the assembler. They set run time parameters (e.g., memory, threads, grid), algorithmic parameters (e.g., error rates, trimming aggressiveness), and enable or disable entire processing steps (e.g., don’t correct errors, don’t search for subreads). They are described later. One parameter is required: the genomeSize (in bases, with common SI prefixes allowed, for example, 4.7m or 2.8g; see Global Options). Parameters are listed in the Canu Parameter Reference, but the common ones are described in this document（常见的参数设置在本文档中，高级参数参见后面的专门文档）.

Reads are supplied to canu by options that describe how the reads were generated, and what level of quality they are, for example, -pacbio-raw indicates the reads were generated on a PacBio RS II instrument, and have had no processing done to them. Each file of reads supplied this way becomes a ‘library’ of reads. The reads should have been (physically) generated all at the same time using the same steps, but perhaps sequenced in multiple batches. In canu, each library has a set of options setting various algorithmic parameters, for example, how aggressively to trim. To explicitly set library parameters, a text ‘gkp’ file describing the library and the input files must be created. Don’t worry too much about this yet, it’s an advanced feature, fully described in Section gkp-files（高级参数，设定输入库）.

The read-files contain sequence data in either FASTA or FASTQ format (or both! A quirk of the implementation allows files that contain both FASTA and FASTQ format reads). The files can be uncompressed, gzip, bzip2 or xz compressed. We’ve found that “gzip -1” provides good compression that is fast to both compress and decompress. For ‘archival’ purposes, we use “xz -9”.（输入文件的格式）

Canu, the pipeline

The canu pipeline, that is, what it actually computes, comprises of computing overlaps and processing the overlaps to some result. Each of the three tasks (read correction, read trimming and unitig construction) follow the same pattern（别看三个任务完成的工作不同，但它们的工作模式是相同的）:

Load reads into the read database, gkpStore（载入reads到read数据库里）.
Compute k-mer counts in preparation for the overlap computation（计算k-mer，为计算overlap做准备）.
Compute overlaps.
Load overlaps into the overlap database, ovlStore（将overlaps载入到overlap数据库里）.
Do something interesting with the reads and overlaps（后续特异的工作）.

The read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads.
The read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed（用overlap决定哪些位置需要被剪掉）. After trimming, the single largest high-quality chunk（块） of sequence is retained.

The unitig construction task finds sets of overlaps that are consistent, and uses those to place reads into a multialignment layout（形成olc中的l）. The layout is then used to generate a consensus sequence（形成olc中的c） for the unitig（建图找最优路径问题）.

Module Tags模块标签

Because each of the three tasks share common algorithms（三个任务共用了相同的算法） (all compute overlaps, two compute consensus sequences, etc), parameters are differentiated by a short prefix ‘tag’ string（使用tag来标记参数）. This lets canu have one generic parameter that can be set to different values for each stage in each task.

For example, “corOvlMemory” will set memory usage for overlaps being generated for read correction; “obtOvlMemory” for overlaps generated for Overlap Based Trimming; “utgOvlMemory” for overlaps generated for unitig construction.

The tags are:

We’ll get to the details eventually.

Execution Configuration执行配置

There are two modes that canu runs in（canu的运行模式有两种）:

locally, using just one machine, or grid-enabled, using multiple hosts managed by a grid engine. At present, only Sun Grid Engine / Open Grid Engine / Univa Grid Engine / SGE / whatever it’s called now, PBS, and Slurm support is available. LSF support is also included but has limited testing. Section Grid Engine Configuration has a few hints on how to set up a new grid engine.

To enable execution of the parallel steps on the grid, set useGrid=1. The canu pipeline will immediately submit itself to the grid, and run entirely under grid control. This is the default if a grid-engine is detected on your system. If you prefer to run locally, set useGrid=0.

In both cases, local or grid, Canu will auto-detect available resources and scale the jobs to run, based on the resources and genome size you’re assembling. Thus, most users should be able to run the command without modifying the defaults. Some advanced options are outlined below. Each stage has the same five configuration options, and tags are used to specialize the option to a specific stage. The options are:

useGrid<tag>=boolean 
Run this stage on the grid, usually in parallel. 

gridOptions<tag>=string 
Supply this string to the grid submit command. 

<tag>Memory=integer 
Use this many gigabytes of memory, per process. 

<tag>Threads 
Use this many compute threads per process. 

<tag>Concurrency 
If not on the grid, run this many jobs at the same time.

Global grid options, applied to every job submitted to the grid, can be set with ‘gridOptions’. This can be used to add accounting information or access credentials.

A name can be associated with this compute using ‘gridOptionsJobName’. Canu will work just fine with no name set, but if multiple canu assemblies are running at the same time, they will tend to wait for each others jobs to finish. For example, if two assemblies are running, at some point both will have overlap jobs running. Each assembly will be waiting for all jobs named ‘ovl_asm’ to finish. Had the assemblies specified job names, gridOptionsJobName=apple and gridOptionsJobName=orange, then one would be waiting for jobs named ‘ovl_asm_apple’, and the other would be waiting for jobs named ‘ovl_asm_orange’.

Error Rates

Canu expects all error rates to be reported as fraction error（分数错误，就是0.01，不是1%）, not as percent error. We’re not sure exactly why this is so. Previously, it used a mix of fraction error and percent error (or both!), and was a little confusing. Here’s a handy table you can print out that converts between fraction error and percent error. Not all values are shown (it’d be quite a large table) but we have every confidence you can figure out the missing values:

Fraction Error Percent Error

0.01 1%

0.02 2%

0.03 3%

. .

. .

0.12 12%

. .

. .

Eventually, we want to have Canu take a single error rate, the error rate of a single input read, and derive all other rates from there. This is the parameter errorRate. Currently, the defaults are 0.025 for PacBio sequences and 0.05 for Oxford Nanpore sequences. Typically, you should not need to modify this setting. However, the error rate does affect runtime and lowering it can significantly speed up your assembly. Thus, for low coverage datasets (<=30X) we recommend increasing the error rate slightly (by 1%, so errorRate=0.035 or PacBio)（参数设置建议：测序深度不够时，可以将errorrate设置得高一点） and for high-coverage (>=60X) datasets, we recommend decreasing it (by 1%, so errorRate=0.015 for PacBio).

The following error rates are defined:

errorRate: The expected error rate in a corrected single read（校正的单个read）. This will set the remaining error rates implicitly（隐性的）.

Recall there are three sets of overlaps generated: one for read correction, one for read trimming, and one for unitig construction. The read correction overlaps are typically generated by the ‘mhap’ overlapper, which doesn’t need an error rate as it doesn’t generate alignment based overlaps. The other two sets of overlaps are generated by overlapInCore, which is alignment based, and so requires a maximum error rate. Thus, out of the six possible tags, only two are currently useful for overlap generation（只有两个errorrate是有用的）:

obtOvlErrorRate: Do not compute overlaps used for trimming above this error rate. Applies to the standard overlapper, and realigning mhap overlaps.
utgOvlErrorRate: Do not compute overlaps used for unitig construction above this error rate. Applies to the standard overlapper, and realigning mhap overlaps.

(ADVANCED) It is possible to convert the mhap or minimap overlaps to alignment based overlaps using obtReAlign=true or ovlReAlign=true. If so, the overlaps will be computed using either obtOvlErrorRate or utgOvlErrorRate, depending on which overlaps are being generated.

Be sure to not confuse obtOvlErrorRate with obtErrorRate:

obtErrorRate: Filter overlaps during OBT’s read trimming and read splitting.

In celera-assembler, consensus generation required an overlap. In canu, this is no longer used, but the cnsErrorRate option still exists.

Minimum Lengths最小长度

Two minimum sizes are known:

minReadLength: Discard reads shorter than this when loading into the assembler, and when trimming reads.
minOverlapLength: Do not save overlaps shorter than this.

Overlap configuration

The largest compute of the assembler is also the most complicated to configure（最需要计算资源，最难设置得部分）. As shown in the ‘module tags’ section, there are up to eight (!) different overlapper configurations. For each overlapper (‘ovl’ or ‘mhap’) there is a global configuration, and three specializations that apply to each stage in the pipeline (correction, trimming or assembly).

Like with ‘grid configuration’, overlap configuration uses a ‘tag’ prefix applied to each option. The tags in this instance are ‘cor’, ‘obt’ and ‘utg’.

For example:

To change the k-mer size for all instances of the ovl overlapper, ‘merSize=23’ would be used.
To change the k-mer size for just the ovl overlapper used during correction, ‘corMerSize=16’ would be used.
To change the mhap k-mer size for all instances, ‘mhapMerSize=18’ would be used.
To change the mhap k-mer size just during correction, ‘corMhapMerSize=15’ would be used.

To use minimap for overlap computation just during correction, ‘corOverlapper=minimap’ would be used.

Ovl Overlapper Configuration

<tag>Overlapper: select the overlap algorithm to use, ‘ovl’ or ‘mhap’.

Ovl Overlapper Parameters

<tag>ovlHashBlockLength 
how many bases to reads to include in the hash table; directly controls process size 

<tag>ovlRefBlockSize 
how many reads to compute overlaps for in one process; directly controls process time 

<tag>ovlRefBlockLength 
same, but use ‘bases in reads’ instead of ‘number of reads’ 

<tag>ovlHashBits 
size of the hash table (SHOULD BE REMOVED AND COMPUTED, MAYBE TWO PASS) 

<tag>ovlHashLoad 
how much to fill the hash table before computing overlaps (SHOULD BE REMOVED) 

<tag>ovlMerSize 
size of kmer seed; smaller - more sensitive, but slower

The overlapper will not use frequent kmers to seed overlaps. These are computed by the ‘meryl’ program, and can be selected in one of three ways.

Terminology. A k-mer is a contiguous sequence of k bases. The read ‘ACTTA’ has two 4-mers: ACTT and CTTA. To account for reverse-complement sequence, a ‘canonical kmer’ is the lexicographically smaller of the forward and reverse-complemented kmer sequence. Kmer ACTT, with reverse complement AAGT, has a canonical kmer AAGT. Kmer CTTA, reverse-complement TAAG, has canonical kmer CTTA.

A ‘distinct’ kmer is the kmer sequence with no count associated with it. A ‘total’ kmer (for lack of a better term) is the kmer with its count. The sequence CGTTTTTTTCGTCG has (forward) 12 ‘total’ 4-mers and 7 ‘distinct’ kmers.

TCGTTTTTTTCGTCG  count
TCGT             2 distinct-1
 CGTT            1 distinct-2
  GTTT           1 distinct-3
   TTTT          4 distinct-4
    TTTT         4 copy of distinct-4
     TTTT        4 copy of distinct-4
      TTTT       4 copy of distinct-4
       TTTC      4 copy of distinct-4
        TTCG     1 distinct-5
         TCGT    2 copy of distinct-1
          CGTC   1 distinct-6
           GTCG  1 distinct-7

<tag>MerThreshold 
any kmer with count higher than N is not used 

<tag>MerDistinct 
pick a threshold so as to seed overlaps using this fraction of all distinct kmers in the input. In the example above, fraction 0.8572 of the k-mers (6/7) will be at or below threshold 2. 

<tag>MerTotal 
pick a threshold so as to seed overlaps using this fraction of all kmers in the input. In the example above, fraction 0.6364 of the k-mers (7/11) will be at or below threshold 2. 

<tag>FrequentMers 
don’t compute frequent kmers, use those listed in this fasta file

Mhap Overlapper Parameters

<tag>MhapBlockSize 
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. 

<tag>MhapMerSize 
Use k-mers of this size for detecting overlaps. 

<tag>ReAlign 
After computing overlaps with mhap, compute a sequence alignment for each overlap. 

<tag>MhapSensitivity 
Either ‘normal’, ‘high’, or ‘fast’.

Mhap also will down-weight frequent kmers (using tf-idf), but it’s selection of frequent is not exposed.

Minimap Overlapper Parameters

<tag>MMapBlockSize 
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. 

<tag>MMapMerSize 
Use k-mers of this size for detecting overlaps

Minimap also will ignore high-frequency minimzers, but it’s selection of frequent is not exposed.