最近在做关于短文本的分词和自定义类别标示工作,通过调研分析Jieba源码,发现开源的分词库里融合了许多算法:Trie数对词典的存储,HMM、Viterbi对文本的序列分割和标注,正向最大匹配算法。
掌握了jieba分词的原理,就可以通过优化建立一个更加完善的分词器。
废话不多说了,开始进入源码分析整体
1.打开jiaba开源代码
__init__.py文件中就有我们常用的cut函数:根据用户指定的模式用正则将输入的文本分块block,然后对每一块根据不同的模式(默认精确模式~调用get_DAG函数)切割。
1 def cut(self, sentence, cut_all=False, HMM=True): 2 ''' 3 The main function that segments an entire sentence that contains 4 Chinese characters into seperated words. 5 6 Parameter: 7 - sentence: The str(unicode) to be segmented. 8 - cut_all: Model type. True for full pattern, False for accurate pattern. 9 - HMM: Whether to use the Hidden Markov Model. 10 ''' 11 sentence = strdecode(sentence) 12 13 if cut_all: 14 re_han = re_han_cut_all 15 re_skip = re_skip_cut_all 16 else: 17 re_han = re_han_default 18 re_skip = re_skip_default 19 if cut_all: 20 cut_block = self.__cut_all 21 elif HMM: 22 cut_block = self.__cut_DAG #有向无环图切割 23 else: 24 cut_block = self.__cut_DAG_NO_HMM 25 blocks = re_han.split(sentence) #分块 26 for blk in blocks: 27 if not blk: 28 continue 29 if re_han.match(blk): 30 for word in cut_block(blk): 31 yield word 32 else: 33 tmp = re_skip.split(blk) 34 for x in tmp: 35 if re_skip.match(x): 36 yield x 37 elif not cut_all: 38 for xx in x: 39 yield xx 40 else: 41 yield x
jieba中用户自定义的词库文件dict的每行格式:词 词频 词性。通过初始化语料库函数initialize(dictionary)