24.通过ngram分词机制实现index-time搜索推荐

一、ngram和index-time搜索推荐原理

1、什么是ngram

假设有一个单词：quick，在5种长度下的ngram情况如下：

ngram length=1，q u i c k

ngram length=2，qu ui ic ck

ngram length=3，qui uic ick

ngram length=4，quic uick

ngram length=5，quick

什么是edge ngram，就是首字母后进行ngram。比如quick这个单词，拆分如下：

q
qu
qui
quic
quick

使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能，搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了就不再进行其他扫描。这就类似match的全文检索。

2、什么是index-time

index-time搜索推荐是指在建立索引时就把搜索推荐的倒排索引建立好，在搜索时就不用再根据前缀去建立。

min ngram = 1，是指推荐的分词最小的个字母个数，如hello 分词为h

max ngram = 3，是指推荐的分词最大的个字母个灵敏，如hello 分词为hel之后就不再进行分词，也就是说不再分词为hell。

二、实验

1、建立索引

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"autocomplete_filter": {

"type": "edge_ngram",

"min_gram": 1,

"max_gram": 20

}

"analyzer": {

"autocomplete": {

"type": "custom",

"tokenizer": "standard",

"filter": [

"lowercase",

"autocomplete_filter"

]

}

2、查看分词情况

GET /my_index/_analyze

{

"analyzer": "autocomplete",

"text": "quick brown"

}

3、加入搜索数据的mapping

PUT /my_index/_mapping/my_type

{

"properties": {

"title": {

"type": "string",

"analyzer": "autocomplete",

"search_analyzer": "standard"

}

4、进行推荐搜索

GET /my_index/my_type/_search

{

"query": {

"match_phrase": {

"title": "hello w"

}

GET /my_index/my_type/_search

{

"query": {

"match": {

"title": "hello w"

}

如果用match，只有hello的也会出来，全文检索，只是分数比较低

推荐使用match_phrase，要求每个term都有，而且position刚好靠着1位，符合我们的期望的