16 Finding a Protein Motif

Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given: At most 15 UniProt Protein Database access IDs.

Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

Sample Dataset

A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST

Sample Output

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614

#coding=utf-8
import urllib2
import re
list = ['A2Z669','B5ZC00','P07204_TRBM_HUMAN','P20840_SAG1_YEAST']

for one in list:
    name = one.strip('
')
    url = 'http://www.uniprot.org/uniprot/'+name+'.fasta'
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    the_page = response.read()
    start = the_page.find('
M')
    seq = the_page[start+1:].replace('
','')
    seq = ' '+seq
    regex = re.compile(r'N(?=[^P][ST][^P])')
    index = 0
    out = []
    '''
    out = [m.start() for m in re.finditer(regex, seq)]
    '''

    index = 0
    while(index<len(seq)):
        index += 1

        if re.search(regex,seq[index:]) == None:
            break


        #print S[index:]
        if re.match(regex,seq[index:]) != None:
            out.append(index)




    if out != []:
        print name
        print ' '.join([ str(i) for i in out])

  

原文地址:https://www.cnblogs.com/think-and-do/p/7283840.html