pybedtools 提取序列

前言

pybedtools 是封装了BEDTools 所有可用的程序。下文学习下pybedtools 如何通过bed文件的坐标提取对应序列

正文

对pybedtools还不了解的参考下这篇文章在Python中使用BEDTools。

提取序列的方法在BEDTools 中的命令是bedtools getfasta, 在pybedtools中是BedTool.sequence方法。

第一步是创建BEDTool实例, 作为例子，这里用字符串来生成：

import pybedtools

a = pybedtools.BedTool("""
chr1 1 10 feature1
chr1 50 55 feature2""", from_string=True)

# 若从bed文件中生成则
# bed = pybedtools.BedTool(bed_file_path)

指定fasta文件地址

## fasta 是一个软件自带的测试fasta文件地址
fasta = pybedtools.example_filename('test.fa')
a = a.sequence(fi=fasta)
print(open(a.seqfn).read())

输出

>chr1:1-10
GATGAGTCT
>chr1:50-55
CCATC

保存文件，可以通过设置fo参数设置保存地址

a = a.sequence(fi=fasta, fo="a1.fa")

也可以使用save_seqs()方法

a.save_seqs("a2.fa")

还有个参数nameOnly,值得一说。之前提取的"a1.fa"是这样的：

>chr1:1-10
GATGAGTCT
>chr1:50-55
CCATC

nameOnly 参数可以让fasta的header变为bed 的name 一列

a.sequence(fi=fasta, fo="a3.fa", nameOnly=True)

a3序列：

>feature1
GATGAGTCT
>feature2
CCATC

而name参数，是name + 坐标，见下面参数文档

Original BEDTools help::

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.30.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
        -fi             Input FASTA file
        -fo             Output file (opt., default is STDOUT
        -bed            BED/GFF/VCF file of ranges to extract from -fi
        -name           Use the name field and coordinates for the FASTA header
        -name+          (deprecated) Use the name field and coordinates for the FASTA header
        -nameOnly       Use the name field for the FASTA header
        -split          Given BED12 fmt., extract and concatenate the sequences
                        from the BED "blocks" (e.g., exons)
        -tab            Write output in TAB delimited format.
        -bedOut         Report extract sequences in a tab-delimited BED format instead of in FASTA format.
                        - Default is FASTA format.
        -s              Force strandedness. If the feature occupies the antisense,
                        strand, the sequence will be reverse complemented.
                        - By default, strand information is ignored.
        -fullHeader     Use full fasta header.
                        - By default, only the word before the first space or tab 
                        is used.
        -rna    The FASTA is RNA not DNA. Reverse complementation handled accordingly.

参考

pybedtools.bedtool.BedTool.sequence