Elasticsearch深入搜索之结构化搜索及JavaAPI的使用

一、Es中创建索引

1.创建索引：

在之前的Es插件的安装和使用中说到创建索引自定义分词器和创建type，当时是分开写的，其实创建索引时也可以创建type，并指定分词器。

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  },
  "mappings": {
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_max_word_pinyin"
        },
        "age":{
          "type":"integer"
        }
      }
    }
  }
}

2.添加数据

POST /my_index/my_type/_bulk
{ "index": { "_id":1}}
{ "id":1,"name": "张三","age":20}
{ "index": { "_id": 2}}
{ "id":2,"name": "张四","age":22}
{ "index": { "_id": 3}}
{ "id":3,"name": "张三李四王五","age":20}

3.查看数据类型

GET /my_index/my_type/_mapping

结果：
{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "age": {
            "type": "integer"
          },
          "id": {
            "type": "integer"
          },
          "name": {
            "type": "text",
            "analyzer": "ik_max_word_pinyin"
          }
        }
      }
    }
  }
}

二、结合JAVA（在这之前需在项目中配置好es，网上有好多例子可以参考）

1.创建Es实体类

package com.example.es_query_list.entity.es;

import lombok.Getter;
import lombok.Setter;
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;

@Setter
@Getter
@Document(indexName = "my_index",type = "my_type")
public class User {
    @Id
    private Integer id;
    private String name;
    private Integer age;
}

2.创建dao层

package com.example.es_query_list.repository.es;

import com.example.es_query_list.entity.es.User;
import org.springframework.data.elasticsearch.repository.ElasticsearchRepository;

public interface EsUserRepository extends ElasticsearchRepository<User,Integer> {
}

三、基本工作完成后，开始查询

1.精确值查询

查询非文本类型数据

GET /my_index/my_type/_search
{
  "query": {
    "term": {
      "age": {
        "value": "20"
      }
    }
  }
}


结果:
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "张三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "name": "李四",
          "age": 20
        }
      }
    ]
  }
}

2.查询文本类型

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

这时小伙伴们可能看到查询结果为空，为什么精确匹配却查不到我输入的准确值呢？？？之前说过咱们在创建type时，字段指定的分词器，如果输入未被分析出来的词是查不到结果的，让我们证明一下！！！！

首先先查看一下咱们查询的词被分析成哪几部分

GET my_index/_analyze
{
  "text":"张三李四王五",
  "analyzer": "ik_max_word"
}

结果：
{
  "tokens": [
    {
      "token": "张三李四",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "张三",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "三",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "李四",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "四",
      "start_offset": 3,
      "end_offset": 4,
      "type": "TYPE_CNUM",
      "position": 4
    },
    {
      "token": "王",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 5
    },
    {
      "token": "五",
      "start_offset": 5,
      "end_offset": 6,
      "type": "TYPE_CNUM",
      "position": 6
    }
  ]
}

结果说明，张三李四王五被没有被分析成张三李四王五，所以查询结果为空。

解决方法：更新type中字段属性值，自定义一个映射指定类型为keyword类型，该类型在es中是指不会被分词器分析，也就是说这就是传说中的准确不能再准确的值了

POST /my_index/_mapping/my_type
{
  "properties": {
    "name": {
            "type": "text",
            "analyzer": "ik_max_word_pinyin",
            "fields": {
              "keyword":{  //自定义映射名
                "type": "keyword"
              }
            }
          }
  }
}

设置好完成后，需将原有的数据删除在添加一遍，再次查询就能查到了

 public List<User> termQuery() {
        QueryBuilder queryBuilder = QueryBuilders.termQuery("age",20);
//        QueryBuilder queryBuilder = QueryBuilders.termQuery("name.keyword","张三李四王五");
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withIndices("my_index")
                .withTypes("my_type")
                .withQuery(queryBuilder)
                .build();

        List<User> list = template.queryForList(searchQuery,User.class);
        return list;
    }

四、组合过滤器

布尔过滤器

注意：官方文档有点问题，在5.X后，filtered 被bool代替了，The filtered query is replaced by the bool query。

一个 bool 过滤器由三部分组成：

{
   "bool" : {
      "must" :     [],
      "should" :   [],
      "must_not" : [],
   }
}

must所有的语句都必须（must）匹配，与 AND 等价。

must_not所有的语句都不能（must not）匹配，与 NOT 等价。

should至少有一个语句要匹配，与 OR 等价。

就这么简单！当我们需要多个过滤器时，只须将它们置入 bool 过滤器的不同部分即可。

GET /my_index/my_type/_search
{
  "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"age" : 20}}, 
                 { "term" : {"age" : 30}} 
              ],
              "must" : {
                 "term" : {"name.keyword" : "张三"} 
              }
           }
      }
}

 public List<User> boolQuery() {
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
        boolQueryBuilder.should(QueryBuilders.termQuery("age",20));
        boolQueryBuilder.should(QueryBuilders.termQuery("age",30));
        boolQueryBuilder.must(QueryBuilders.termQuery("name.keyword","张三"));
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withIndices("my_index")
                .withTypes("my_type")
                .withQuery(boolQueryBuilder)
                .build();
        List<User> list = template.queryForList(searchQuery,User.class);
        return list;
    }

嵌套布尔过滤器

尽管 bool 是一个复合的过滤器，可以接受多个子过滤器，需要注意的是 bool 过滤器本身仍然还只是一个过滤器。这意味着我们可以将一个 bool 过滤器置于其他 bool 过滤器内部，这为我们提供了对任意复杂布尔逻辑进行处理的能力。

GET /my_index/my_type/_search
{
  "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"age" : 20}}, 
                 { "bool" : {
                   "must": [
                     {"term": {
                       "name.keyword": {
                         "value": "李四"
                       }
                     }}
                   ]
                 }} 
              ]
           }
      }
}

结果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "张三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "张三李四王五",
          "age": 20
        }
      }
    ]
  }
}

因为 term 和 bool 过滤器是兄弟关系，他们都处于外层的布尔逻辑 should 的内部，返回的命中文档至少须匹配其中一个过滤器的条件。

这两个 term 语句作为兄弟关系，同时处于 must 语句之中，所以返回的命中文档要必须都能同时匹配这两个条件。

五、查找多个精确值

GET my_index/my_type/_search
{
  "query": {
    "terms": {
      "age": [
        20,
        22
      ]
    }
  }
}

结果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "name": "张四",
          "age": 22
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "张三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "张三李四王五",
          "age": 20
        }
      }
    ]
  }
}

一定要了解 term 和 terms 是 包含（contains） 操作，而非 等值（equals） （判断）。

TermsQueryBuilder termsQueryBuilder = QueryBuilders.termsQuery("age",list);

六、范围查询

1、数字范围查询

GET my_index/my_type/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 10,
        "lte": 20
      }
    }
  }
}

结果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "张三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "张三李四王五",
          "age": 20
        }
      }
    ]
  }
}

注：gt(大于) gte(大于等于) lt(小于) lte(小于等于)

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("age").gte(10).lte(20);

2.对于时间范围查询

更新type，添加时间字段

POST /my_index/_mapping/my_type
{
"properties": {
"date":{
"type":"date",
"format":"yyyy-MM-dd"
}
}
}

添加数据：

POST /my_index/my_type/_bulk
{ "index": { "_id":4}}
{ "id":4,"name": "赵六","age":20,"date":"2018-10-1"}
{ "index": { "_id": 5}}
{ "id":5,"name": "对七","age":22,"date":"2018-11-20"}
{ "index": { "_id": 6}}
{ "id":6,"name": "王八","age":20,"date":"2018-7-28"}

查询：

GET my_index/my_type/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2018-10-20",
        "lte": "2018-11-29"
      }
    }
  }
}

结果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "5",
        "_score": 1,
        "_source": {
          "id": 5,
          "name": "对七",
          "age": 22,
          "date": "2018-11-20"
        }
      }
    ]
  }
}

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("date").gte("2018-10-20").lte("2018-11-29");

七、处理null值

1.添加数据

POST /my_index/posts/_bulk
{ "index": { "_id": "1"              }}
{ "tags" : ["search"]                }  
{ "index": { "_id": "2"              }}
{ "tags" : ["search", "open_source"] }  
{ "index": { "_id": "3"              }}
{ "other_field" : "some data"        }  
{ "index": { "_id": "4"              }}
{ "tags" : null                      }  
{ "index": { "_id": "5"              }}
{ "tags" : ["search", null]          }

2.查询指定字段存在的数据

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {    //不在去计算评分，默认都是1
            "filter" : {    
                "exists" : { "field" : "tags" }
            }
        }
    }
} 

结果：
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "5",
        "_score": 1,
        "_source": {
          "tags": [
            "search",
            null
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "2",
        "_score": 1,
        "_source": {
          "tags": [
            "search",
            "open_source"
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "1",
        "_score": 1,
        "_source": {
          "tags": [
            "search"
          ]
        }
      }
    ]
  }
}

BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags")));

3.查询指定字段缺失数据

注：Filter Query Missing 已经从 ES 5 版本移除

GET /my_index/posts/_search
{
    "query" : {
        "bool": {
          "must_not": [
            {"constant_score": {
              "filter": {
                "exists": {
                "field": "tags"
              }}
            }}
          ]
        }
    }
}


查询结果：
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "4",
        "_score": 1,
        "_source": {
          "tags": null
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "3",
        "_score": 1,
        "_source": {
          "other_field": "some data"
        }
      }
    ]
  }
}

注：处理null值，当字段内容为空时，将自定义将其当做为null值处理

boolQueryBuilder.mustNot(QueryBuilders.boolQuery().filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags"))));

八、关于缓存

1.核心

　　　其核心实际是采用一个 bitset 记录与过滤器匹配的文档。Elasticsearch 积极地把这些 bitset 缓存起来以备随后使用。一旦缓存成功，bitset 可以复用任何已使用过的相同过滤器，而无需再次计算整个过滤器。

这些 bitsets 缓存是“智能”的：它们以增量方式更新。当我们索引新文档时，只需将那些新文档加入已有 bitset，而不是对整个缓存一遍又一遍的重复计算。和系统其他部分一样，过滤器是实时的，我们无需担心缓存过期问题。

2.独立的过滤器缓存

　　属于一个查询组件的 bitsets 是独立于它所属搜索请求其他部分的。这就意味着，一旦被缓存，一个查询可以被用作多个搜索请求。bitsets 并不依赖于它所存在的查询上下文。这样使得缓存可以加速查询中经常使用的部分，从而降低较少、易变的部分所带来的消耗。

同样，如果单个请求重用相同的非评分查询，它缓存的 bitset 可以被单个搜索里的所有实例所重用。

让我们看看下面例子中的查询，它查找满足以下任意一个条件的电子邮件：

查询条件（例子）：（1）在收件箱中，且没有被读过的（2）不在收件箱中，但被标注重要的

GET /inbox/emails/_search
{
  "query": {
      "constant_score": {
          "filter": {
              "bool": {
                 "should": [
                    { "bool": {                                                  1
                          "must": [
                             { "term": { "folder": "inbox" }}, 
                             { "term": { "read": false }}
                          ]
                    }},
                    { "bool": {　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　2　　　　
                          "must_not": {
                             "term": { "folder": "inbox" } 
                          },
                          "must": {
                             "term": { "important": true }
                          }
                    }}
                 ]
              }
            }
        }
    }
}

1和2共用的一个过滤器，所以使用同一个bitset

尽管其中一个收件箱的条件是 must 语句，另一个是 must_not 语句，但他们两者是完全相同的。这意味着在第一个语句执行后， bitset 就会被计算然后缓存起来供另一个使用。当再次执行这个查询时，收件箱的这个过滤器已经被缓存了，所以两个语句都会使用已缓存的 bitset 。

这点与查询表达式（query DSL）的可组合性结合得很好。它易被移动到表达式的任何地方，或者在同一查询中的多个位置复用。这不仅能方便开发者，而且对提升性能有直接的益处。

3.自动缓存行为

在 Elasticsearch 的较早版本中，默认的行为是缓存一切可以缓存的对象。这也通常意味着系统缓存 bitsets 太富侵略性，从而因为清理缓存带来性能压力。不仅如此，尽管很多过滤器都很容易被评价，但本质上是慢于缓存的（以及从缓存中复用）。缓存这些过滤器的意义不大，因为可以简单地再次执行过滤器。

检查一个倒排是非常快的，然后绝大多数查询组件却很少使用它。例如 term 过滤字段 "user_id" ：如果有上百万的用户，每个具体的用户 ID 出现的概率都很小。那么为这个过滤器缓存 bitsets 就不是很合算，因为缓存的结果很可能在重用之前就被剔除了。

这种缓存的扰动对性能有着严重的影响。更严重的是，它让开发者难以区分有良好表现的缓存以及无用缓存。

为了解决问题，Elasticsearch 会基于使用频次自动缓存查询。如果一个非评分查询在最近的 256 次查询中被使用过（次数取决于查询类型），那么这个查询就会作为缓存的候选。但是，并不是所有的片段都能保证缓存 bitset 。只有那些文档数量超过 10,000 （或超过总文档数量的 3% )才会缓存 bitset 。因为小的片段可以很快的进行搜索和合并，这里缓存的意义不大。

一旦缓存了，非评分计算的 bitset 会一直驻留在缓存中直到它被剔除。剔除规则是基于 LRU 的：一旦缓存满了，最近最少使用的过滤器会被剔除。

如果一个人没有梦想，和咸鱼有什么区别？