elasticsearch(v2.4.6)添加中文分词器ik

一、参考

ik github文档

将maven源改为国内阿里云镜像

二、编译安装 analysis-ik

2.1 下载源码

git clone --depth 1 --branch v1.10.6 https://github.com/medcl/elasticsearch-analysis-ik.git

因为ES2.4.6对应的ik v1.10.6，所以仅仅clone该tag源码

2.2 编译

(1) 下载安装 maven

# 源码下载
wget https://mirror.olnevhost.net/pub/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz

# 解压目录
mkdir /usr/local/maven

tar -zxvf apache-maven-3.6.3-bin.tar.gz --directory /usr/local/maven

# 环境变量设置
export JAVA_HOME=/home/java/jdk1.8.0_131
MAVEN_HOME=/usr/local/maven/apache-maven-3.6.3
export MAVEN_HOME

export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin

source /etc/profile

# 查看版本信息
mvn --version

(2) 编译ik

# 编译
cd elasticsearch-analysis-ik/

mvn package

# 将编译文件添加到plugins

cd cd target/releases/

cp elasticsearch-analysis-ik-1.10.6.zip /home/elastic/elasticsearch-2.4.6/plugins/ik/

cd /home/elastic/elasticsearch-2.4.6/plugins/

unzip elasticsearch-analysis-ik-1.10.6.zip

2.3 重启es服务

三、测试ik分词效果

3.1 内置的中文分词

# 请求

GET http://127.0.0.1:9200/_analyze
{
	"text": "正是江南好风景"
}

# 返回
{
  "tokens": [
    {
      "token": "正",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "江",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "南",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "好",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "风",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "景",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    }
  ]
}

3.2 ik的ik_max_word分词器

# 请求

GET http://127.0.0.1:9200/_analyze
{
	"analyzer": "ik_max_word",
	"text": "正是江南好风景"
}


# 返回
{
  "tokens": [
    {
      "token": "正是",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "江南",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "江",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "南",
      "start_offset": 3,
      "end_offset": 4,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "好",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 4
    },
    {
      "token": "风景",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "景",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    }
  ]
}

3.3 ik的ik_smart分词器

# 请求

GET http://127.0.0.1:9200/_analyze
{
	"analyzer": "ik_smart",
	"text": "正是江南好风景"
}

# 返回
{
  "tokens": [
    {
      "token": "正是",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "江南",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "好",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "风景",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

3.4 比较结果

(1) 默认的分词器将中文按照一个个汉字来分词，肯定不符合大部分使用场景

(2) ik_max_word会作最细粒度的分词，而ik_smart则正相反，会作最粗粒度的分词