深入解析Jieba分词源码(包括Trie存储,正向匹配,HMM,viterbi)

最近在做关于短文本的分词和自定义类别标示工作,通过调研分析Jieba源码,发现开源的分词库里融合了许多算法:Trie数对词典的存储,HMM、Viterbi对文本的序列分割和标注,正向最大匹配算法。

掌握了jieba分词的原理,就可以通过优化建立一个更加完善的分词器。

废话不多说了,开始进入源码分析整体

1.打开jiaba开源代码

__init__.py文件中就有我们常用的cut函数:根据用户指定的模式用正则将输入的文本分块block,然后对每一块根据不同的模式(默认精确模式~调用get_DAG函数)切割。

 1     def cut(self, sentence, cut_all=False, HMM=True):
 2         '''
 3         The main function that segments an entire sentence that contains
 4         Chinese characters into seperated words.
 5 
 6         Parameter:
 7             - sentence: The str(unicode) to be segmented.
 8             - cut_all: Model type. True for full pattern, False for accurate pattern.
 9             - HMM: Whether to use the Hidden Markov Model.
10         '''
11         sentence = strdecode(sentence)
12 
13         if cut_all:
14             re_han = re_han_cut_all
15             re_skip = re_skip_cut_all
16         else:
17             re_han = re_han_default
18             re_skip = re_skip_default
19         if cut_all:
20             cut_block = self.__cut_all
21         elif HMM:
22             cut_block = self.__cut_DAG #有向无环图切割
23         else:
24             cut_block = self.__cut_DAG_NO_HMM
25         blocks = re_han.split(sentence) #分块
26         for blk in blocks:
27             if not blk:
28                 continue
29             if re_han.match(blk):
30                 for word in cut_block(blk):
31                     yield word
32             else:
33                 tmp = re_skip.split(blk)
34                 for x in tmp:
35                     if re_skip.match(x):
36                         yield x
37                     elif not cut_all:
38                         for xx in x:
39                             yield xx
40                     else:
41                         yield x

jieba中用户自定义的词库文件dict的每行格式:词 词频 词性。通过初始化语料库函数initialize(dictionary)

原文地址:https://www.cnblogs.com/AngelaSunny/p/5276336.html