lucene 3下最快的中文分词器 - zzzhc's Blog
Lucene 3下最快的中文分词器
Jun 19th, 2011
包包分词器 - 一个基于字典的快速中文分词器
source code
features
- 简单 1000LOC
- 高效 7M+ chars/second
- 支持中文,英语,数字
- 自动识别未登录词
- 支持OffsetAttribute
- 支持TypeAttribute
- 支持PositionIncrementAttribute
usage
1 2 3 4 5 6 7 8Dict dict = new Dict(); dict.addAllSpecialTypes(); BufferedReader dictReader = new BufferedReader(new InputStreamReader( new FileInputStream("dict.txt"), "UTF-8")); dict.load(dictReader); dictReader.close(); dict.optimize(); DictAnalyzer dictAnalyzer = new DictAnalyzer(dict);
benchmark
ant benchmark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31supported features: CharTerm Offset PositionIncrement Term Type IKAnalyzer Y Y N Y N MMSegAnalyzer Y Y N Y Y PaodingAnalyzer Y Y N Y Y StandardAnalyzer Y Y Y Y Y BaoBaoAnalyzer Y Y Y Y Y test 1, sample length=26265 name chars time tokens speed(chars/second) PaodingAnalyzer 26265 0.610 12542 43036.87 MMSegAnalyzer 26265 0.314 14007 83566.52 IKAnalyzer 26265 0.262 16016 100177.91 StandardAnalyzer 26265 0.141 22366 185727.87 BaoBaoAnalyzer 26265 0.038 18185 695682.16 test 2, sample length=262650 name chars time tokens speed(chars/second) PaodingAnalyzer 262650 0.187 125420 1402139.61 IKAnalyzer 262650 0.163 160160 1613693.16 MMSegAnalyzer 262650 0.158 140070 1664009.53 BaoBaoAnalyzer 262650 0.041 181850 6362134.44 StandardAnalyzer 262650 0.020 223660 12905789.80 test 3, sample length=2626500 name chars time tokens speed(chars/second) IKAnalyzer 2626500 2.251 1601600 1166564.72 PaodingAnalyzer 2626500 1.462 1254200 1796381.55 MMSegAnalyzer 2626500 1.043 1400700 2519010.94 BaoBaoAnalyzer 2626500 0.352 1818500 7458959.20 StandardAnalyzer 2626500 0.202 2236600 13015280.16
Jun 19th, 2011