05 Computing GC Content

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540


方法一：

# -*- coding: utf-8 -*-


# to open FASTA format sequence file:
s=open('Computing_GC_Content.txt','r').readlines()

# to create two lists, one for names, one for sequences
name_list=[]
seq_list=[]

data='' # to put the sequence from several lines together

for line in s:
    line=line.strip()
    for i in line:
        if i == '>':
            name_list.append(line[1:])
            if data:
                seq_list.append(data)         #将每一行的的核苷酸字符串连接起来
                data=''                       # 合完后data 清零
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):    #验证是不是所有的都是大写
        data=data+line
seq_list.append(data)                         # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
    i=0
    for k in seq:
        if k=="G" or k=='C':
            i+=1
    GC_cont=float(i)/len(seq)*100.0
    GC_list.append(GC_cont)


m=max(GC_list)
print name_list[GC_list.index(m)]              # to find the index of max GC
print "{:0.6f}".format(m)                    # 保留6位小数

　　方法二：

# -*- coding: utf-8 -*-

def parse_fasta(s):
    results = {}
    strings = s.strip().split('>')
    # Python split()通过指定分隔符对字符串进行切片，如果参数num 有指定值，则仅分隔 num 个子字符串

    for s in strings:
        if len(s) == 0:
            continue
            # 如果字符串长度为0，就跳出循环。

        parts = s.split()
        label = parts[0]
        bases = ''.join(parts[1:])

        results[label] = bases

    return results


def gc_content(s):
    n = len(s)
    m = 0

    for c in s:
        if c == 'G' or c == 'C':
            m += 1

    return 100 * (float(m) / n)


if __name__ == "__main__":

    small_dataset = """
    >Rosalind_6404
    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
    TCCCACTAATAATTCTGAGG
    >Rosalind_5959
    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
    ATATCCATTTGTCAGCAGACACGC
    >Rosalind_0808
    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
    TGGGAACCTGCGGGCAGTAGGTGGAAT
    """

    #large_dataset = open('datasets/rosalind_gc.txt').read()

    results = parse_fasta(small_dataset)
    results = dict([(k, gc_content(v)) for k, v in results.iteritems()])
    # 这里iteritem()和item()功能是一样的
    # 前一个results输出，名称+序列，后一个results输出，名称+百分比


    highest_k = None
    highest_v = 0

    for k, v in results.iteritems():
        if v > highest_v:
            highest_k = k
            highest_v = v
            # 输出GC含量高的
    print highest_k
    print '%f%%' % highest_v

　　方法三：

# -*- coding: utf-8 -*-

### 5. Computing GC Content ###
from operator import itemgetter
from collections import OrderedDict

seqTest = OrderedDict()
gcContent = OrderedDict()

with open('Computing_GC_Content.txt', 'rt') as f:
    for line in f:
        line = line.rstrip()
        if line.startswith('>'):
            seqName = line[1:]
            seqTest[seqName] = ''
            continue
        seqTest[seqName] += line.upper()

for ke, val in seqTest.items():
    totalLength = len(val)
    gcNumber = val.count('G') + val.count('C')
    gcContent[ke] = (float(gcNumber) / totalLength)*100

sortedGCContent = sorted(gcContent.items(), key=itemgetter(1))
largeName = sortedGCContent[-1][0]
largeGCContent = sortedGCContent[-1][1]

print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))