斯坦福分词工具的试用

下载链接 戳这里

下载后的文件夹是这样的:

 

然后打开eclipse,新建项目,把源文件segDemo.java拷贝进去,把jar包全丢进去(右键项目, properties,Java Build Path,Add External Jars)

导入data数据包,并且修改源码中的路径,如图所示:

然后修改segDemo.java并且测试

 1 package test;
 2 import java.io.*;
 3 import java.util.List;
 4 import java.util.Properties;
 5 
 6 import edu.stanford.nlp.ie.crf.CRFClassifier;
 7 import edu.stanford.nlp.ling.CoreLabel;
 8 
 9 
10 /** This is a very simple demo of calling the Chinese Word Segmenter
11  *  programmatically.  It assumes an input file in UTF8.
12  *  <p/>
13  *  <code>
14  *  Usage: java -mx1g -cp seg.jar SegDemo fileName
15  *  </code>
16  *  This will run correctly in the distribution home directory.  To
17  *  run in general, the properties for where to find dictionaries or
18  *  normalizations have to be set.
19  *
20  *  @author Christopher Manning
21  */
22 
23 public class SegDemo {
24 
25   private static final String basedir = System.getProperty("SegDemo", "data");
26 
27   public static void main(String[] args) throws Exception {
28     System.setOut(new PrintStream(System.out, true, "utf-8"));
29 
30     Properties props = new Properties();
31     props.setProperty("sighanCorporaDict", basedir);
32     // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
33     // props.setProperty("normTableEncoding", "UTF-8");
34     // below is needed because CTBSegDocumentIteratorFactory accesses it
35     props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
36     if (args.length > 0) {
37       props.setProperty("testFile", args[0]);
38     }
39     props.setProperty("inputEncoding", "UTF-8");
40     props.setProperty("sighanPostProcessing", "true");
41 
42     CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
43     segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
44     for (String filename : args) {
45       segmenter.classifyAndWriteAnswers(filename);
46     }
47 
48     String sample = "我住在美国。";
49     List<String> segmented = segmenter.segmentString(sample);
50     System.out.println(segmented);
51   }
52 
53 }

输出:[我, 住在, 美国, 。]

之后请随意发挥吧~

原文地址:https://www.cnblogs.com/kuqs/p/5435574.html