解析器

IK分词器

各平台配置IK分词器

elasticsearch-analysis-ik releases:https://github.com/medcl/elasticsearch-analysis-ik/releases

  • Windows:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
  • Mac:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
  • Centos:cd到elasticsearch安装目录的plugings目录,下载并解压:
[root@cs ik]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.4/elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# unzip  elasticsearch-analysis-ik-6.5.4.zip
[root@cs ik]# ll
total 5832
-rw-r--r--. 1 root root  263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r--. 1 root root   61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x. 2 root root    4096 Aug 26  2018 config
-rw-r--r--. 1 root root   54693 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.jar
-rw-r--r--. 1 root root 4504539 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.zip
-rw-r--r--. 1 root root  736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r--. 1 root root  326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r--. 1 root root    1805 Dec 23  2018 plugin-descriptor.properties
-rw-r--r--. 1 root root     125 Dec 23  2018 plugin-security.policy

测试

  • 首先将elascticsearch和kibana服务重启。
  • 然后浏览器地址栏输入http://localhost:5601(kibana地址),在Dev Tools中的Console界面的左侧输入命令,再点击绿色的执行按钮执行。
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "学不学的会靠天收"
}

ik目录简介

我们简要的介绍一下ik分词配置文件:

  • IKAnalyzer.cfg.xml,用来配置自定义的词库
  • main.dic,ik原生内置的中文词库,大约有27万多条,只要是这些单词,都会被分在一起。
  • surname.dic,中国的姓氏。
  • suffix.dic,特殊(后缀)名词,例如乡、江、所、省等等。
  • preposition.dic,中文介词,例如不、也、了、仍等等。
  • stopword.dic,英文停用词库,例如a、an、and、the等。
  • quantifier.dic,单位名词,如厘米、件、倍、像素等。

其他解析器

Simple Analyzer – 按照非字母切分(符号等被过滤),小写处理

GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

执行结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

Stop Analyzer – 小写处理,停用词过滤(the,a,is)

GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

执行结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

剩下的解析器依次为

Whitespace Analyzer – 按照空格切分,不转小写

Keyword Analyzer – 不分词,直接将输入当作输出

Patter Analyzer – 正则表达式,默认 W+ (非字符分隔)

Language – 提供了30多种常见语言的分词器

#查看不同的analyzer的效果
#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃"
}

自律人的才是可怕的人
原文地址:https://www.cnblogs.com/lovelifest/p/14325480.html