Elasticsearch入门之从零开始安装ik分词器

起因

需要在ES中使用聚合进行统计分析,但是聚合字段值为中文,ES的默认分词器对于中文支持非常不友好:会把完整的中文词语拆分为一系列独立的汉字进行聚合,显然这并不是我的初衷。我们来看个实例:

POST http://192.168.80.133:9200/my_index_name/my_type_name/_search
{
	"size": 0,
	"query" : {
		"range" : {
			"time": {
			    "gte": 1513778040000,
			    "lte": 1513848720000
			}
		}
    },
    "aggs": {
		"keywords": {
		    "terms": {"field": "keywords"},
			"aggs": {
			    "emotions": {
				    "terms": {"field": "emotion"}
                }
			}
		}	
    }
}

输出结果:

{
	"took": 22,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 32,
		"max_score": 0.0,
		"hits": []
	},
	"aggregations": {
		"keywords": {
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "力",  # 完整的词被拆分为独立的汉字
					"doc_count": 2,
					"emotions": {
						"doc_count_error_upper_bound": 0,
						"sum_other_doc_count": 0,
						"buckets": [
							{
								"key": -1,
								"doc_count": 1
							},
							{
								"key": 0,
								"doc_count": 1
							}
						]
					}
				},
				{
					"key": "动",
					"doc_count": 2,
					"emotions": {
						"doc_count_error_upper_bound": 0,
						"sum_other_doc_count": 0,
						"buckets": [
							{
								"key": -1,
								"doc_count": 1
							},
							{
								"key": 0,
								"doc_count": 1
							}
						]
					}
				}
			]
		}
	}
}

既然ES的默认分词器对于中文支持非常不友好,那么有没有可以支持中文的分词器呢?如果有,该如何使用呢?
第一个问题,万能的谷歌告诉了我结果,已经有了支持中文的分词器,而且是开源实现:IK Analysis for Elasticsearch,详见:https://github.com/medcl/elasticsearch-analysis-ik。
秉着“拿来主义”不重复造轮子的指导思想,直接先拿过来使用一下,看看效果怎么样。那么,如何使用IK分词器呢?其实这是一个ES插件,直接安装并对ES进行相应的配置即可。

安装IK分词器

我的ES版本为2.4.1,需要下载的IK版本为:1.10.1(注意:必须下载与ES版本对应的IK,否则不能使用)。

1.下载,编译IK

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.1/elasticsearch-analysis-ik-1.10.1.zip
unzip elasticsearch-analysis-ik-1.10.1.zip
cd elasticsearch-analysis-ik-1.10.1
mvn clean package

在elasticsearch-analysis-ik-1.10.1 arget eleases目录下生成打包文件:elasticsearch-analysis-ik-1.10.1.zip。

2.在ES中安装IK插件

将上述打包好的IK插件:elasticsearch-analysis-ik-1.10.1.zip拷贝到ES/plugins目录下,执行解压。

unzip elasticsearch-analysis-ik-1.10.1.zip
rm -rf elasticsearch-analysis-ik-1.10.1.zip # 解压完之后一定要删除这个zip包,否则在启动ES时报错

重启ES。

使用IK分词器

安装IK分词器完毕之后,就可以在ES使用了。

第一步:新建index

PUT http://192.168.80.133:9200/my_index_name

第二步:给将来要使用的doc字段添加mapping
在这里我在ES中存储的doc格式如下:

{
    "nagtive_kw": []
    "is_all": false,
    "emotion": 0,
    "focuce": false,
    "keywords": ["动力","外观","油耗"],  // 在keywords字段上进行聚合分析
    "source": "汽车之家",
    "time": -1,
    "machine_emotion": 0,
    "title": "no title",
    "spider": "qczj_index",
    "content": {},
    "url": "http://xxx",
    "brand": "宝马",
    "series": "宝马1系",
    "model": "2017款"
}

需要在keywords字段上进行聚合分析,所以给keywords字段添加mapping设置:

POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping
{
	"properties": {
		"keywords": { # 设置keywords字段使用ik分词器
			"type": "string",
			"store": "no",
			"analyzer": "ik_smart",
			"search_analyzer": "ik_smart",
			"boost": 8
		}
	}
}

注意: 在设置mapping时有一个小插曲,我根据IK的官网设置“keywords”的type为“text”时报错:

POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping
{
	"properties": {
		"keywords": {
			"type": "text", # text类型在2.4.1版本中不支持
			"store": "no",
			"analyzer": "ik_smart",
			"search_analyzer": "ik_smart",
			"boost": 8
		}
	}
}

报错:

{
	"error": {
		"root_cause": [
			{
				"type": "mapper_parsing_exception",
				"reason": "No handler for type [text] declared on field [keywords]"
			}
		],
		"type": "mapper_parsing_exception",
		"reason": "No handler for type [text] declared on field [keywords]"
	},
	"status": 400
}

这是因为我使用的ES版本比较低:2.4.1,而text类型是ES5.0之后才添加的类型,所以不支持。在ES2.4.1版本中需要使用string类型。

第三步:添加doc对象

POST http://192.168.80.133:9200/my_index_name/my_type_name/
{
    "nagtive_kw": ["动力","外观","油耗"]
    "is_all": false,
    "emotion": 0,
    "focuce": false,
    "keywords": ["动力","外观","油耗"],  // 在keywords字段上进行聚合分析
    "source": "汽车之家",
    "time": -1,
    "machine_emotion": 0,
    "title": "从动次打次吃大餐",
    "spider": "qczj_index",
    "content": {},
    "url": "http://xxx",
    "brand": "宝马",
    "series": "宝马1系",
    "model": "2017款"
}

第四步:聚合分析

POST http://192.168.80.133:9200/my_index_name/my_type_name/_search
{
	"size": 0,
	"query" : {
		"range" : {
			"time": {
			    "gte": 1513778040000,
			    "lte": 1513848720000
			}
		}
    },
    "aggs": {
		"keywords": {
		    "terms": {"field": "keywords"},
			"aggs": {
			    "emotions": {
				    "terms": {"field": "emotion"}
                }
			}
		}	
    }
}

输出结果:

{
	"took": 22,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 32,
		"max_score": 0.0,
		"hits": []
	},
	"aggregations": {
		"keywords": {
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "动力",     # 完整的词没有被拆分为独立的汉字
					"doc_count": 2,
					"emotions": {
						"doc_count_error_upper_bound": 0,
						"sum_other_doc_count": 0,
						"buckets": [
							{
								"key": -1,
								"doc_count": 1
							},
							{
								"key": 0,
								"doc_count": 1
							}
						]
					}
				}
			]
		}
	}
}

【参考】
http://www.cnblogs.com/xing901022/p/5910139.html 如何在Elasticsearch中安装中文分词器(IK+pinyin)
https://elasticsearch.cn/question/47 关于聚合(aggs)的问题
https://github.com/medcl/elasticsearch-analysis-ik/issues/276 create map时出现No handler for type [text] declared on field [content] #276
http://blog.csdn.net/guo_jia_liang/article/details/52980716 Elasticsearch2.4学习(三)------Elasticsearch2.4插件安装详解

原文地址:https://www.cnblogs.com/nuccch/p/8207261.html