mismatch位置(MD tag)- sam/bam格式解读进阶

这算是第二讲了,前面一讲是:Edit Distance编辑距离(NM tag)- sam/bam格式解读进阶

MD是mismatch位置的字符串的表示形式,貌似在call SNP和indel的时候会用到。

当然我这里要说的只是利用它来计算mismatch的个数

MD = line.get_tag('MD')
pat = "[0-9]+[ATGC]+"
MD_list = re.findall(pat,MD)
for i in MD_list:
        for j in i:
                if j == 'A' or j == 'T' or j == 'G' or j == 'C':
                        total_mismatch_MD += 1

几行代码简单搞定~~~

 

额,那这篇文章是不是太水了

好吧,那就再深入一点

先看一篇文章:SAM/BAM MD tag

The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string "10A5^AC6" means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.

原文地址:https://www.cnblogs.com/leezx/p/6074826.html