DNA sequence open reading frames (ORFs) | DNA序列的开放阅读框ORF预测

常见的ORF预测工具

基本概念

开放阅读框（英语：Open reading frame；缩写：ORF；其他译名：开放阅读框架、开放读架等）是指在给定的阅读框架中，不包含终止密码子的一串序列。这段序列是生物个体的基因组中，可能作为蛋白质编码序列的部分。基因中的ORF包含并位于开始编码与终止编码之间。由于一段DNA或RNA序列有多种不同读取方式，因此可能同时存在许多不同的开放阅读框架。有一些计算机程序可分析出最可能是蛋白质编码的序列。

关键词：

1. 不包含终止密码子的一串序列；

2. 可能作为蛋白质编码序列的部分；

3. 有多种不同读取方式，因此可能同时存在许多不同的开放阅读框架；

4. 有些工具会用blast比对来提高可信度

示例

一段5'-UCUAAAGGUCCA-3'序列。此序列共有3种读取法：

UCU AAA GGU CCA
CUA AAG GUC
UAA AGG UCC

由于UAA为终止编码，因此第三种读取法不具编译出蛋白质的潜力，故只有前两者为开放阅读框架

个人当然是推荐使用NCBI大佬开发的工具的啦，发文章可信度高些。

以下是Linux版该工具的说明：

USAGE
  ORFfinder [-h] [-help] [-xmlhelp] [-in Input_File] [-id Accession_GI]
    [-b begin] [-e end] [-c circular] [-g Genetic_code] [-s Start_codon]
    [-ml minimal_length] [-n nested_ORFs] [-strand Strand] [-out Output_File]
    [-outfmt output_format] [-logfile File_Name] [-conffile File_Name]
    [-version] [-version-full] [-dryrun]

DESCRIPTION
   Searching open reading frames in a sequence

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -xmlhelp
   Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other
   parameters
 -logfile <File_Out>
   File to which the program log should be redirected
 -conffile <File_In>
   Program's configuration (registry) data file
 -version
   Print version number;  ignore other arguments
 -version-full
   Print extended version data;  ignore other arguments
 -dryrun
   Dry run the application: do nothing, only test all preconditions

 *** Input query options (one of them has to be provided):
 -in <File_In>
   name of file with the nucleotide sequence in FASTA format
   (more than one sequence is allowed)
   Default = `'
 -id <String>
   Accession or gi number of the nucleotide sequence
   (ignored, if the file name is provided)
   Default = `'

 *** Query sequence details:
 -b <Integer>
   Start address of sequence fragment to be processed
   Default = `1'
 -e <Integer>
   Stop address of sequence fragment to be processed (0 - to the end of the
   sequence)
   Default = `0'
 -c <Boolean>
   Is the sequence circular? (t/f) *** Under development
   Default = `false'

 *** Search parameters:
 -g <Integer>
   Genetic code to use (1-31)
   see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details
   Default = `1'
 -s <Integer>
   ORF start codon to use:
       0 = "ATG" only
       1 = "ATG" and alternative initiation codons
       2 = any sense codon
   Default = `1'
 -ml <Integer>
   Minimal length of the ORF (nt)
   Value less than 30 is automatically changed by 30.
   Default = `75'
 -n <Boolean>
   Ignore nested ORFs (completely placed within another)
   Default = `false'
 -strand <String>
   Output ORFs on specified strand only (both|plus|minus)
   Default = `both'

 *** Output options:
 -out <File_Out>
   Output file name
 -outfmt <Integer>
   Output options:
       0 = list of ORFs in FASTA format
       1 = CDS in FASTA format
       2 = Text ASN.1
       3 = Feature table
   Default = `0'

ORFfinder -in in.fasta -s 2 -ml 100 -out test.out -outfmt 3