有时候需要比较大的序列文件里提取出它的子集,通常是根据序列号来操作的,上代码:
#存入字典 def read_to_dict(in_file): sequence = {} ac = '' seq = '' for line in open(in_file): if line.startswith('>') and seq != '': sequence[ac] = seq seq = '' if line.startswith('>'): ac = line.strip().split()[1] #处理序列key命名,视情况而定 else: seq = seq + line.strip() sequence[ac] = seq return sequence #待提取序列号读取入列表 def read_to_list(in_file2): list2 = [] with open(in_file2, 'r', encoding='utf-8') as list1: for line in list1: list2.append(line.strip()) return list2 #提取 other_features = read_to_dict('other_genomic.fasta') list = read_to_list('search_results.txt') handle = open("selected.txt", "w") for acc in list: if acc in other_features.keys(): handle.write('>' + acc + ' ' + other_features[acc] + ' ') else: print(acc) handel.close()
原文件:
>ABC102 ABC102 SPXID:S000121252, Chr I from 608-776, Genome Release 55-0-1, ABC, “Autonomously Replicating Sequence”
TATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTT
…
待提取序列:
ABC301
ABC305
ABC306
ABC901
…
输出结果:
>ABC301
TATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTT
…
BioPython里也有类似的操作,但是输出效果不好,不如上面的灵活。
>>> from Bio import SeqIO >>> uniprot = SeqIO.index("other_genomic.fasta", "fasta") >>> handle = open("selected.txt", "w") >>> for acc in ["ABC301", "ABC305", "ABC306", "ABC901", ... "ABC998"]: ... handle.write(uniprot.get_raw(acc)) >>> handle.close()
原文链接:https://blog.csdn.net/liuninghua521/article/details/106780991