转录组分析---Hisat2+StringTie+Ballgown使用

(2016-10-10 08:14:45)

转载▼

标签：

生物信息学

转录组

1.Hisat2建立基因组索引：

First, using the python scripts included in the HISAT2 package, extract splice-site and exon information from the gene

annotation file:

$ extract_splice_sites.py gemome.gtf >genome.ss#得到剪接位点信息

$ extract_exons.py genome.gtf >genome.exon#得到外显子信息

Second, build a HISAT2 index:

$ hisat2-build --ss genome.ss --exon genome.exon genome.fa genome

备注extract_splice_sites.py 和 extract_exons.py 在hisat2软件包中涵盖了，这两步不是必须的，只是为了发现剪切位点，也可以直接：

$ hisat2-build genome.fa genome

2. 利用hisat2比对到基因组：

hisat2 -p 8 --dta -x genome -1 file1_1.fastq.gz -2 file1_2.fastq.gz -S file1.sam

hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 file2_1.fastq.gz -2 file2_2.fastq.gz -S file2.sam

备注：--dta:输出转录组型的报告文件

-x：基因组索引

-S : 输出sam文件

-p: 线程数

其他参数：

Input:

-q query input files are FASTQ .fq/.fastq (default)

--qseq query input files are in Illumina's qseq format

-f query input files are (multi-)FASTA .fa/.mfa

-r query input files are raw one-sequence-per-line

-c , , are sequences themselves, not files

-s/--skip skip the first reads/pairs in the input (none)

-u/--upto stop after first reads/pairs (no limit)

-5/--trim5 trim bases from 5'/left end of reads (0)

-3/--trim3 trim bases from 3'/right end of reads (0)

--phred33 qualities are Phred+33 (default)

--phred64 qualities are Phred+64

--int-quals qualities encoded as space-delimited integers

Alignment:

-N max # mismatches in seed alignment; can be 0 or 1 (0)

-L length of seed substrings; must be >3, <32 (22)

-i interval between seed substrings w/r/t read len (S,1,1.15)

--n-ceil func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)

--dpad include extra ref chars on sides of DP table (15)

--gbar disallow gaps within nucs of read extremes (4)

--ignore-quals treat all quality values as 30 on Phred scale (off)

--nofw do not align forward (original) version of read (off)

--norc do not align reverse-complement version of read (off)

Spliced Alignment:

--pen-cansplice penalty for a canonical splice site (0)

--pen-noncansplice penalty for a non-canonical splice site (12)

--pen-canintronlen penalty for long introns (G,-8,1) with canonical splice sites

--pen-noncanintronlen penalty for long introns (G,-8,1) with noncanonical splice sites

--min-intronlen minimum intron length (20)

--max-intronlen maximum intron length (500000)

--known-splicesite-infile provide a list of known splice sites

--novel-splicesite-outfile report a list of splice sites

--novel-splicesite-infile provide a list of novel splice sites

--no-temp-splicesite disable the use of splice sites found

--no-spliced-alignment disable spliced alignment

--rna-strandness Specify strand-specific information (unstranded)

--tmo Reports only those alignments within known transcriptome

--dta Reports alignments tailored for transcript assemblers

--dta-cufflinks Reports alignments tailored specifically for cufflinks

Scoring:

--ma match bonus (0 for --end-to-end, 2 for --local)

--mp , max and min penalties for mismatch; lower qual = lower penalty <2,6>

--sp , max and min penalties for soft-clipping; lower qual = lower penalty <1,2>

--np penalty for non-A/C/G/Ts in read/ref (1)

--rdg , read gap open, extend penalties (5,3)

--rfg , reference gap open, extend penalties (5,3)

--score-min min acceptable alignment score w/r/t read length

(L,0.0,-0.2)

Reporting:

(default) look for multiple alignments, report best, with MAPQ

-k report up to alns per read; MAPQ not meaningful

-a/--all report all alignments; very slow, MAPQ not meaningful

Effort:

-D give up extending after failed extends in a row (15)

-R for reads w/ repetitive seeds, try sets of seeds (2)

Paired-end:

--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)

--no-mixed suppress unpaired alignments for paired reads

--no-discordant suppress discordant alignments for paired reads

Output:

-t/--time print wall-clock time taken by search phases

--un write unpaired reads that didn't align to

--al write unpaired reads that aligned at least once to

--un-conc write pairs that didn't align concordantly to

--al-conc write pairs that aligned concordantly at least once to

(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

--un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.)

--quiet print nothing to stderr except serious errors

--met-file send metrics to file at (off)

--met-stderr send metrics to stderr (off)

--met report internal counters & metrics every secs (1)

--no-head supppress header lines, i.e. lines starting with @

--no-sq supppress @SQ header lines

--rg-id set read group id, reflected in @RG line and RG:Z: opt field

--rg add ("lab:value") to @RG line of SAM header.

Note: @RG line only printed when --rg-id is set.

--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.

Performance:

-o/--offrate override offrate of index; must be >= index's offrate

-p/--threads number of alignment threads to launch (1)

--reorder force SAM output order to match order of input reads

--mm use memory-mapped I/O for index; many 'bowtie's can share

Other:

--qc-filter filter out reads that are bad according to QSEQ filter

--seed seed for random number generator (0)

--non-deterministic seed rand. gen. arbitrarily instead of using read attributes

--version print version information and quit

-h/--help print this usage message

3. 将sam文件sort并转化成bam:

$ samtools sort -@ 8 -o file1.bam file1.sam

$ samtools sort -@ 8 -o file2.bam file2.sam

4. 组装转录本：

$ stringtie -p 8 -G genome.gtf -o file1.gtf –l file1 file1.bam

$ stringtie -p 8 -G genome.gtf -o file2.gtf –l file2 file2.bam

lncRNA (-f 0.01 -a 10 -j 1 -c 0.01)

其中：

-G reference annotation to use for guiding the assembly process (GTF/GFF3)

-l name prefix for output transcripts (default: STRG)

-f minimum isoform fraction (default: 0.1)

-m minimum assembled transcript length (default: 200)

-o output path/file name for the assembled transcripts GTF (default: stdout)

-a minimum anchor length for junctions (default: 10)

-j minimum junction coverage (default: 1)

-t disable trimming of predicted transcripts based on coverage

(default: coverage trimming is enabled)

-c minimum reads per bp coverage to consider for transcript assembly

(default: 2.5)

-v verbose (log bundle processing details)

-g gap between read mappings triggering a new bundle (default: 50)

-C output a file with reference transcripts that are covered by reads

-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)

-p number of threads (CPUs) to use (default: 1)

-A gene abundance estimation output file

-B enable output of Ballgown table files which will be created in the

same directory as the output GTF (requires -G, -o recommended)

-b enable output of Ballgown table files but these files will be

created under the directory path given as

-e only estimate the abundance of given reference transcripts (requires -G)

-x do not assemble any transcripts on the given reference sequence(s)

-h print this usage message and exit

5. 合并所有样本的gtf文件

$ stringtie --merge -p 8 -G genome.gtf -o stringtie_merged.gtf mergelist.txt

6. 新转录本的注释（lncRNA必备，普通转录组忽略）

gffcompare –r genomegtf –G –o merged stringtie_merged.gtf

备注：gffcompare 是独立软件，下载地址http://ccb.jhu.edu/software/stringtie/gff.shtml，结果如下;

= Predicted transcript has exactly the same introns as the reference transcript

c Predicted transcript is contained within the reference transcript

j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript

e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment

i Predicted transcript falls entirely within a reference intron

o Exon of predicted transcript overlaps a reference transcript

p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment)

r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence

u Predicted transcript is intergenic in comparison with known reference transcripts

x Exon of predicted transcript overlaps reference but lies on the opposite strand

s Intron of predicted transcript overlaps a reference intron on the opposite strand

7. 转录本定量和下游ballgown软件原始文件构建：

$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file1/file1.gtf file1.bam

$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file2/file2.gtf file2.bam

8. Ballgown差异表达分析：

>library(ballgown)

>library(RSkittleBrewer)

>library(genefilter)

>library(dplyr)

>library(devtools)

>pheno_data = read.csv("geuvadis_phenodata.csv")#读取表型数据

>bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "file", pData=pheno_data)#读取表达量

>bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)#过滤低表达量基因

>results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")#差异表达分析，运用的是一般线性模型，比较组sex，影响因素：population

>results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")#基因差异表达

>results_transcripts=data.frame(geneNames=ballgown::geneNames(bg_chrX_filt),geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)#增加基因名字，id

>results_transcripts = arrange(results_transcripts,pval)#按pval sort

>results_genes = arrange(results_genes,pval)

>write.csv(results_transcripts, "chrX_transcript_results.csv",

row.names=FALSE)

>write.csv(results_genes, "chrX_gene_results.csv",

row.names=FALSE)

>subset(results_transcripts,results_transcripts$qval<0.05)

>subset(results_genes,results_genes$qval<0.05)

9. 结果可视化：

>tropical= c('darkorange', 'dodgerblue',

'hotpink', 'limegreen', 'yellow')

>palette(tropical)

>fpkm = texpr(bg_chrX,meas="FPKM")

>fpkm = log2(fpkm+1)

>boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')

>ballgown::transcriptNames(bg_chrX)[12]

## 12

## "NM_012227"

>ballgown::geneNames(bg_chrX)[12]

## 12

## "GTPBP6"

>plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2),

main=paste(ballgown::geneNames(bg_chrX)[12],' : ',

ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex",

ylab='log2(FPKM+1)')

>points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)),

col=as.numeric(pheno_data$sex))

>plotTranscripts(ballgown::geneIDs(bg_chrX)[1729], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))

>plotMeans('MSTRG.56', bg_chrX_filt,groupvar="sex",legend=FALSE)