如何提取大序列文件的子集

有时候需要比较大的序列文件里提取出它的子集,通常是根据序列号来操作的,上代码:

#存入字典
def read_to_dict(in_file):
    sequence = {}
    ac = ''
    seq = ''
    for line in open(in_file):
        if line.startswith('>') and seq != '':
            sequence[ac] = seq
            seq = ''
        if line.startswith('>'):
            ac = line.strip().split()[1] #处理序列key命名,视情况而定
        else:
            seq = seq + line.strip()
    sequence[ac] = seq
    return sequence

#待提取序列号读取入列表
def read_to_list(in_file2):
    list2 = []
    with open(in_file2, 'r', encoding='utf-8') as list1:
        for line in list1:
            list2.append(line.strip())
    return list2

#提取
other_features = read_to_dict('other_genomic.fasta')
list = read_to_list('search_results.txt')
handle = open("selected.txt", "w")
for acc in list:
    if acc in other_features.keys():
        handle.write('>' + acc + '
' + other_features[acc] + '
')
    else:
        print(acc)
handel.close()

原文件:

>ABC102 ABC102 SPXID:S000121252, Chr I from 608-776, Genome Release 55-0-1, ABC, “Autonomously Replicating Sequence”
TATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTT

待提取序列:

ABC301
ABC305
ABC306
ABC901

输出结果:

>ABC301
TATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTT

BioPython里也有类似的操作,但是输出效果不好,不如上面的灵活。

>>> from Bio import SeqIO
>>> uniprot = SeqIO.index("other_genomic.fasta", "fasta")
>>> handle = open("selected.txt", "w")
>>> for acc in ["ABC301", "ABC305", "ABC306", "ABC901", ... "ABC998"]:
...     handle.write(uniprot.get_raw(acc))
>>> handle.close()


原文链接:https://blog.csdn.net/liuninghua521/article/details/106780991

原文地址:https://www.cnblogs.com/LQZ888/p/13153724.html