N-Gram的数据结构

ARPA的n-gram语法如下:

[html] view plaincopyprint?
data  
ngram 1=64000  
ngram 2=522530  
ngram 3=173445  
  
1-grams:  
-5.24036        'cause  -0.2084827  
-4.675221       'em     -0.221857  
-4.989297       'n      -0.05809768  
-5.365303       'til    -0.1855581  
-2.111539       </s>    0.0  
-99     <s>     -0.7736475  
-1.128404       <unk>   -0.8049794  
-2.271447       a       -0.6163939  
-5.174762       a's     -0.03869072  
-3.384722       a.      -0.1877073  
-5.789208       a.'s    0.0  
-6.000091       aachen  0.0  
-4.707208       aaron   -0.2046838  
-5.580914       aaron's -0.06230035  
-5.789208       aarons  -0.07077657  
-5.881973       aaronson        -0.2173971  

具体说明见 :ARPA的n-gram语言模型格式

整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构

一,n-gram数据结构

n-gram的数据结构如下:

typedef struct  
{  
    real        log_prob ;  
    real        log_bo ;  
    int         *words ;  
} ARPALMEntry ;  

words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。
log_bo,表示ngram的回退概率。
log_prob,表示ngram的组合概率。

二,ARPA-LM数据结构

多个项组成的整个n-gram语言模型的数据结构如下:
[cpp] view plaincopyprint?

class ARPALM  
 {  
    public:  
        Vocabulary *vocab ;  
  
        int            order ;  
        ARPALMEntry    **entries ; // 语言模型的所有项,组成一个数组  
        int            *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。  
  
        char           *unk_wrd ; // 词典中不在语言模型中的词。  
        int            unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。  
  
        int            n_unk_words ;  
        int            *unk_words ;  
    private:   
        bool           *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。  
}  

vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型
entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。
n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。
unk_wrd,词典中可以不在语言模型中的词。
unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。
n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。
unk_words,存储6中统计的词序号。
words_in_lm,这个标识词典中的词是否在语言模型中出现。

原文地址:https://www.cnblogs.com/jonky/p/10154115.html