Elasticsearch 深入4

将一个field索引两次来解决字符串排序

如果对一个string field进行排序，结果往往不准确，因为分词后是多个单词，再排序就不是我们想要的结果了

通常解决方案是，将一个string field建立两次索引，一个分词，用来进行搜索；一个不分词，用来进行排序

PUT /website
{
    "mappings":{
        "article":{
            "properties":{
                "title":{
                    "type":"text", 第一次索引进行分词
                    "fields":{ 第二次索引不进行分词
                        "raw":{
                            "type":"string",
                            "index":"not_analyzed"
                        }
                    },
                    "fielddata":true 正排索引
                },
                "content":{
                    "type":"text"
                },
                "post_date":{
                    "type":"date"
                },
                "author_id":{
                    "type":"long"
                }
            }
        }
    }
}

GET /website/article/_search
{
    "query":{
        "match_all":{

        }
    },
    "sort":[
        {
            "title.raw":{ //如果直接使用title的话是对分词之后的结果排序可能存在问题 title.raw 使用不分词的索引进行排序
                "order":"desc"
            }
        }
    ]
}

相关度评分TF&IDF算法独家解密

1、算法介绍

relevance score算法，简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法

Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关

搜索请求：hello world

doc1：hello you, and world is very good
doc2：hello, how are you

Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关

搜索请求：hello world

doc1：hello, today is very good
doc2：hi world, how are you

比如说，在index中有1万条document，hello这个单词在所有的document中，一共出现了1000次；world这个单词在所有的document中，一共出现了100次

doc2更相关

Field-length norm：field长度，field越长，相关度越弱

搜索请求：hello world

doc1：{ "title": "hello article", "content": "babaaba 1万个单词" }
doc2：{ "title": "my article", "content": "blablabala 1万个单词，hi world" }

hello world在整个index中出现的次数是一样多的

doc1更相关，title field更短

GET /people/man/111/_explain

GET /people/man/_search?explain

{
    "query":{
        "match":{
            "name":"ajax"
        }
    }
}

内核级知识点之doc value初步探秘

搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values

在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面会建立正排索引，也就是doc values，以供排序，聚合，过滤等操作使用

doc values是被保存在磁盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是会很高；如果内存不足够，os会将其写入磁盘上

doc1: hello world you and me

doc2: hi, world, how are you

word doc1 doc2

hello *

world * *

you * *

and *

me *

hi *

how *

are *

hello you --> hello, you

hello 匹配--> doc1

you 匹配--> doc1,doc2

doc1: hello world you and me

doc2: hi, world, how are you

sort by age

doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }

document name age

doc1 jack 27

doc2 tom 30

分布式搜索引擎内核解密之query phase

1、query phase

（1）搜索请求发送到某一个coordinate node，构构建一个priority queue，长度以paging操作from和size为准，默认为10
（2）coordinate node将请求转发到所有shard，每个shard本地搜索，并构建一个本地的priority queue
（3）各个shard将自己的priority queue返回给coordinate node，并构建一个全局的priority queue

2、replica shard如何提升搜索吞吐量

一次请求要打到所有shard的一个replica/primary上去，如果每个shard都有多个replica，那么同时并发过来的搜索请求可以同时打到其他的replica上去

1、fetch phbase工作流程

（1）coordinate node构建完priority queue之后，就发送mget请求去所有shard上获取对应的document
（2）各个shard将document返回给coordinate node
（3）coordinate node将合并后的document结果返回给client客户端

2、一般搜索，如果不加from和size，就默认搜索前10条，按照_score排序

实战基于scoll技术滚动搜索大量数据

如果一次性要查出来比如10万条数据，那么性能会很差，此时一般会采取用scoll滚动查询，一批一批的查，直到所有数据都查询完处理完

使用scoll滚动搜索，可以先搜索一批数据，然后下次再搜索一批数据，以此类推，直到搜索出全部的数据来
scoll搜索会在第一次搜索的时候，保存一个当时的视图快照，之后只会基于该旧的视图快照提供数据搜索，如果这个期间数据变更，是不会让用户看到的
采用基于_doc进行排序的方式，性能较高
每次发送scroll请求，我们还需要指定一个scoll参数，指定一个时间窗口，每次搜索请求只要在这个时间窗口内能完成就可以了

GET /people/man/_search?scroll=1m
{
"query": { "match_all": {}},
"sort" : ["_doc"],
"size": 3
}

GET /_search/scroll
{
"scroll": "1m",
"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACh0FmNIYTh2a2k1UV82RzlYajVaWlFrWncAAAAAAAAocBZOZ21zeV9DTVR2cXdmSnltTF9tRkxRAAAAAAAAKHEWTmdtc3lfQ01UdnF3Zkp5bUxfbUZMUQAAAAAAAChyFk5nbXN5X0NNVHZxd2ZKeW1MX21GTFEAAAAAAAAocxZjSGE4dmtpNVFfNkc5WGo1WlpRa1p3"
}

获得的结果会有一个scoll_id，下一次再发送scoll请求的时候，必须带上这个scoll_id

scoll，看起来挺像分页的，但是其实使用场景不一样。分页主要是用来一页一页搜索，给用户看的；scoll主要是用来一批一批检索数据，让系统进行处理的

定制化自己的dynamic mapping策略

1、定制dynamic策略

true：遇到陌生字段，就进行dynamic mapping
false：遇到陌生字段，就忽略
strict：遇到陌生字段，就报错

PUT /my_index
{
    "mappings":{
        "my_type":{
            "dynamic":"strict",
            "properties":{
                "title":{
                    "type":"text"
                },
                "address":{
                    "type":"object",
                    "dynamic":"true"
                }
            }
        }
    }
}

PUT /my_index/my_type/1
{
    "title":"my article",
    "content":"this is my article",
    "address":{
        "province":"guangdong",
        "city":"guangzhou"
    }
}

{
"error": {
"root_cause": [
{
"type": "strict_dynamic_mapping_exception",
"reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
}
],
"type": "strict_dynamic_mapping_exception",
"reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
},
"status": 400
}报错意思就是这个index第一层全局不能再添加新的field

PUT /my_index/my_type/1
{
    "title":"my article",
    "address":{
        "province":"guangdong",
        "city":"guangzhou"
    }
}

GET /my_index/_mapping/my_type

{
    "my_index":{
        "mappings":{
            "my_type":{
                "dynamic":"strict",
                "properties":{
                    "address":{
                        "dynamic":"true",
                        "properties":{
                            "city":{
                                "type":"text",
                                "fields":{
                                    "keyword":{
                                        "type":"keyword",
                                        "ignore_above":256
                                    }
                                }
                            },
                            "province":{
                                "type":"text",
                                "fields":{
                                    "keyword":{
                                        "type":"keyword",
                                        "ignore_above":256
                                    }
                                }
                            }
                        }
                    },
                    "title":{
                        "type":"text"
                    }
                }
            }
        }
    }
}但是address字段内部"dynamic":"true"是可以添加字段的

2、定制dynamic mapping策略

（1）date_detection

默认会按照一定格式识别date，比如yyyy-MM-dd。但是如果某个field先过来一个2017-01-01的值，就会被自动dynamic mapping成date，后面如果再来一个"hello world"之类的值，就会报错。可以手动关闭某个type的date_detection，如果有需要，自己手动指定某个field为date类型。

PUT /my_index/_mapping/my_type
{
"date_detection": false
}

（2）定制自己的dynamic mapping template（type level）

PUT /my_index
{
    "mappings":{
        "my_type":{
            "dynamic_templates":[
                {
                    "en":{
                        "match":"*_en",
                        "match_mapping_type":"string",
                        "mapping":{
                            "type":"string",
                            "analyzer":"english"
                        }
                    }
                }
            ]
        }
    }
}

PUT /my_index/my_type/1
{
"title": "this is my first article"
}

PUT /my_index/my_type/2
{
"title_en": "this is my first article"
}

title没有匹配到任何的dynamic模板，默认就是standard分词器，不会过滤停用词，is会进入倒排索引，用is来搜索是可以搜索到的
title_en匹配到了dynamic模板，就是english分词器，会过滤停用词，is这种停用词就会被过滤掉，用is来搜索就搜索不到了