Variation calling and annotation

Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean

本文摘自《Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean》

Variation calling and annotation.

Mapping.

SAMtools (Version: 0.1.18) software was used to convert mapping results into the BAM format and to filter the unmapped and non-unique reads.

Duplicated reads were filtered with the Picard package (picard.sourceforge.net, Version:1.87).

The BEDtools (Version: 2.17.0) coverageBed program was used to compute the coverage of sequence alignments. （A sequence was defined as absent if coverage was lower than 90% and present if coverage was greater than 90%.）

SNP calling.

SNP detection was performed using the Genome Analysis Toolkit (GATK, version 2.4-7-g5e89f01) and SAMtools. Only the SNPs detected by both methods were analyzed further.
The detailed processes were as follows:
(1) After BWA alignment, the reads around indels were realigned.
Realignment was performed with GATK in two steps.
The first step used the RealignerTargetCreator package to identify regions where realignment was needed；
The second step used IndelRealigner to realign the regions found in the first step, which produced a realigned BAM file for each accession.
(2) SNPs were called at a population level with GATK and SAMtools. For GATK, the SNP confidence score was set as greater than 30, and the parameter -stand_call_conf was set as 30. The same realigned BAM files were used in SNP calling through the SAMtools mpileup package.
(3) In the filter step, we chose the common sites identified by GATK and SAMtools with the SelectVariants package; SNPs with allele frequencies lower than 1% in the population were discarded.

Indel calling.

Indel calling was similar to SNP calling but with the UnifiedGenotyper parameter -glm INDEL for the indel report only. Only insertions and deletions shorter than or equal to 6 bp were taken into account.

Annotation.

SNP annotation was performed according to the genome using the package ANNOVAR (Version: 2013-08-23).
Based on the genome annotation, SNPs were categorized in exonic regions (overlapping with a coding exon), splicing sites (within 2 bp of a splicing junction), 5′UTRs and 3′UTRs, intronic regions (overlapping with an intron), upstream and downstream regions (within a 1 kb region upstream or downstream from the transcription start site), and intergenic regions.

SNPs in coding exons were further grouped into synonymous SNPs (did not cause amino acid changes) or nonsynonymous SNPs (caused amino acid changes; mutations causing stop gain and stop loss were also classified into this group).

Indels in the exonic regions were classified by whether they had frame-shift (3 bp insertion or deletion) mutations.