Elastisearch笔记

es 和关系型数据库的简单对比

RDBMS	Elasticsearch
Table	Index(Type)
Row	Doucment
Column	Filed
Schema	Mapping
SQL	DSL

## 索引相关信息
GET kibana_sample_data_ecommerce

## 文档总数
GET kibana_sample_data_ecommerce/_count

## _cat indices API
## 模糊匹配
GET /_cat/indices/kibana_*
## 按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc
## 查看文档的一些基本信息
GET /_cat/indices/kibana_sample_data_ecommerce?v

集群的名字默认为 elasticsearch

分片分为 Primary Shard & Replica Shard

创建分片索引时指定主分片数，后续不允许修改，除非 Reindex

副本分片数量可以动态调整

## 集群健康状况
GET _cluster/health

GET _cat/nodes?v
GET _cat/shards?v

index                        shard prirep state   docs   store ip         node
.apm-agent-configuration     0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
.kibana_1                    0     p      STARTED   94 967.7kb 172.18.0.2 12b52a46e43f
kibana_sample_data_ecommerce 0     p      STARTED 4675   4.5mb 172.18.0.2 12b52a46e43f
.apm-custom-link             0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
.kibana_task_manager_1       0     p      STARTED    5  55.2kb 172.18.0.2 12b52a46e43f

简单的 CRUD

## 自动生成id
POST my_index/_doc/
{
  "user":"xiaoting",
  "comment":"you know for search"
}

## 用户指定id，多次 PUT 会更新 version
PUT my_index/_doc/2
{
  "user":"xiaoting",
  "comment":"you know for search"
}

## 读取
GET my_index/_doc/2

## 查询
GET my_index/_search
{
  "query":{
    "match_all":{}
  }
}

## 在原文档上面增加字段，如果用 put，就必须全部指定，不然会缺失字段
POST my_index/_update/2
{
  "doc":{
    "post_date":"2020-05-21"
  }
}

## 删除
DELETE my_index/_doc/2

## 批量读取
GET _mget
{
  "docs": [
    {
      "_index": "my_index",
      "_id": 1
    },
    {
      "_index": "my_index",
      "_id": 2
    }
  ]
}

倒排索引

正排索引——目录页

倒排索引——索引页

分词器 Analysis

三部分组成

Character Filters Tokenizer Token Filters

## 直接指定 Analysis 进行分词
GET /_analyze
{
  "analyzer": "standard",
  "text": "liuchenglong is a student"
}

## 指定索引的字段进行分词，可以模拟分词器对该字段是合种分词结果
GET my_index/_analyze
{
  "field": "user",
  "text": "xiaoting"
}

## 自定义分词器进行分词
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    "lowercase"
  ],
  "text": "liuchenglong is a student"
}

Standard Analyzer 是默认的分词器

GET /_analyze
{
  "analyzer": "standard",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "simple",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "stop",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "keyword",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "pattern",
  "text": "Liuchenglong in the house"
}

GET /_analyze
{
  "analyzer": "english",
  "text": "Liuchenglong in the house"
}

## 中文分词器插件 ik（需要额外安装下载）
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "江苏省无锡市滨湖区溪北新村"
}

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "江苏省无锡市滨湖区溪北新村"
}

Search API

1.URL Search，使用 q 指定查询字符串

2.Request Body Search，使用 get 或者 post，可以在请求体中使用 es 的 DSL 语法

/_search
/index1/_search
/index1,index2/_search
/index*/_search

URL Search

## q 指定查询内容，df 指定查询的字段
GET my_index/_search?q=chenglong&df=user
GET my_index/_search?q=user:chenglong

## 带上 profile:true 可以查看这次查询的计算方式
GET my_index/_search?q=chenglong&df=user
{
  "profile": "true"
}

## PhraseQuery
GET my_index/_search?q=comment:"you know"
## BooleanQuery
GET my_index/_search?q=comment:you know
## term query，要用()将其包裹
GET my_index/_search?q=comment:(you know)
## "comment:you comment:and comment:know"
GET my_index/_search?q=comment:(you and know)
## comment:you comment:not comment:know"
GET my_index/_search?q=comment:(you not know)
## "comment:you +comment:know"   %2B 就是 + 号
GET my_index/_search?q=comment:(you %2Bknow)
## 范围查询
GET my_index/_search?q=year>2020
## 通配符查询
GET my_index/_search?q=user:ch*
## 模糊匹配，可以匹配上 chenglong
GET my_index/_search?q=user:chengleng~1
## 可以查询出 you know for search
GET my_index/_search?q=comment:"you for"~2

Request Body Search

## 分页查询
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "from": 0,
  "size": 20
}

## 按照指定字段排序
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {"_score": {"order": "desc"}}
  ]
}

## 只查询指定的字段
GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "_source": ["user"]
}

## matchQuery TermQuery
GET my_index/_search
{
  "query": {
    "match": {
      "user":"Chenglong"
    }
  }
}

## 指定查询方式
GET my_index/_search
{
  "query": {
    "match": {
      "user":{
        "query": "Chenglong",
        "operator": "and"
      }
    }
  }
}

## match_phrase 可以指定模糊几个单词，下面的查询可以查询出 you know for search
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "comment":{
        "query": "you for",
        "slop": 1
      }
    }
  }
}

脚本字段

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "userName": {
      "script": {
        "lang": "painless",
        "source": "doc['user'].value + 's'"
      }
    }
  }
}

Mapping

有点类似数据库中的 schema 的定义。

简单类型

Text / Keyword

Date

Integer / Floating

Boolean

IPv4 & IPv6

复杂类型 - 对象和嵌套对象

对象类型 / 嵌套类型

特殊类型

geo_point & geo_shape / percolator

Dynamic Mapping

在写入文档的时候，如果索引不存在，会自动创建索引

## 查看 mapping
GET my_index/_mapping

如果字段已经存在，则不允许修改字段的类型，必须使用 Reindex API 进行重建

## 可以在创建 index 的时候指定 mappings 的额类型，默认为 true
PUT movies
{
  "mappings": {
    "_doc": {
      "dynamic": "true | false | strict"
    }
  }
}

自定义 Mapping

## 创建一个 index，其中 mobile 不进行索引
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "text",
        "index": false
      }
    }
  }
}

## 插入数据
PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong",
  "mobile": "1234567890"
}

## 尝试查询会报错
## failed to create query: Cannot search on field [mobile] since it is not indexed.
POST /movies/_search
{
  "query": {
    "match": {
      "mobile": "123"
    }
  }
}

## null_value
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text"
      },
      "lastName": {
        "type": "text"
      },
      "mobile": {
        "type": "keyword",
        "null_value": "NULL"
      }
    }
  }
}

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong",
  "mobile": null
}

PUT movies/_doc/2
{
  "firstName": "Liu",
  "lastName": "Chenglong2"
}

## 可以搜索到 mobile 是 null 的数据，但是搜索不到没有 mobile 的数据
POST /movies/_search
{
  "query": {
    "match": {
      "mobile": "NULL"
    }
  }
}

## copy to
PUT movies
{
  "mappings": {
    "properties": {
      "firstName": {
        "type": "text",
        "copy_to": "fullName"
      },
      "lastName": {
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong"
}

## 可以直接查询 fullName，虽然 movies 里面并没有这个字段
## _source 中并没有 fullName
POST movies/_search
{
  "query": {
    "match": {
      "fullName": "chenglong"
    }
  }
}

数组类型本身是 text，所以如果原来一个字段是 text，那么可以直接插入一个数组

PUT movies/_doc/1
{
  "firstName": "Liu",
  "lastName": "Chenglong"
}

PUT movies/_doc/3
{
  "firstName": "Liu",
  "lastName": ["Chenglong"]
}

多字段属性

实现名字精确查询匹配

增加一个 keyword 字段

使用不同的 analyzer

Exact Value（不需要进行分词处理）

包括日期、数字、具体的一个字符串（Apple Store)

Full Text

es 中的 text

Character Filters

可以在 Tokenizer 之前对文本进行处理，例如增加删除、替换文本

## 可以去除文本中的 html 标签，可以处理网络爬虫爬出来的数据
GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<b>hello world</b>"
}

## 替换文字
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "- => _"
      ]
    }
  ],
  "text": "hello-world"
}

## 按照路径进行分词
GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "user/local/nginx/conf"
}

## 按照空格进行分词，并且去除一些副词进行过滤
## 这里只能查询出 You house
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], 
  "text": "You are in the house."
}

## 添加一个 lowercase 的 filter，就可以将单词变成小写
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "stop",
    "lowercase"
  ],
  "text": "You are in the house."
}

聚合搜索 Aggregation

Bucket 一些满足结果的文档集合

Metric 进行数学运算

Pipeline 对其他聚合结果进行二次聚合

Matrix 支持多个字段操作并提供一个结果矩阵

Bucket 有些像 SQL 中的 group

Metric 有些像 SQL 中的聚合函数

## 性别统计
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "flight_dest": {
      "terms": {
        "field": "customer_gender"
      }
    }
  }
}

## 查询结果
"buckets" : [
  {
    "key" : "FEMALE",
    "doc_count" : 2433
  },
  {
    "key" : "MALE",
    "doc_count" : 2242
  }
]

## 对分组结果继续进行分组
GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "flight_dest": {
      "terms": {
        "field": "day_of_week"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "products.base_price"
          }
        }
      }
    }
  }
}

查询

Term 是表达语义的最小单位

## 添加几条数据
POST /product/_doc/1
{
  "productId":"XHDK-12-#f",
  "desc":"iPhone"
}
POST /product/_doc/2
{
  "productId":"BHDK-22-#f",
  "desc":"iPad"
}
POST /product/_doc/3
{
  "productId":"CHDK-32-#f",
  "desc":"MBP"
}

## 由于 term 不会对搜索进行处理，而插入的数据会被分词，iPhone => iphone
## 所以这里查询不到任何数据
POST /product/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
        "value": "iphone" ## 这样才能查询出来
      }
    }
  }
}

## 这样也可以查询出来
POST /product/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone"
      }
    }
  }
}

## 分词
POST /_analyze
{
  "analyzer": "standard",
  "text": ["iPhone"]
}

{
  "tokens" : [
    {
      "token" : "iphone",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

## 将 Query 转换为 Filter，可以忽略算分的计算，避免不必要的开销
## Filter 可以有效的使用缓存，调高多次的查询效率
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "desc.keyword": "iPhone"
        }
      },
      "boost": 1.2
    }
  }
}

Match Query / Match Phrase Query / Query String Query

索引和搜索时会进行分词，查询时先分词然后再生成一个供查询的词项列表

POST movies/_search
{
  "query": {
    "match": {
      "name": "chenglong"
    }
  }
}

结构化搜索

日期、布尔类型、数字都是结构化的数据

可以用 Term、Prefix前缀查询

## 添加一些数据
POST /product/_bulk
{ "index":{"_id":1}}
{"price":10,"avaliable":true,"date":"2020-05-22","productId":"XXX-1","tag":"one"}
{ "index":{"_id":2}}
{"price":20,"avaliable":false,"date":"2019-05-22","productId":"XXX-2","tag":["one","two"]}
{ "index":{"_id":3}}
{"price":30,"avaliable":false,"productId":"XXX-3"}

## term 查询 boolean
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

## range 查询 数字
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 10,
            "lte": 20
          }
        }
      }
    }
  }
}

## range 查询 日期
y 年
M 月
w 周
d 天
H/h 小时
m 分钟
s 秒
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "date": {
            "gte": "now-1y"
          }
        }
      }
    }
  }
}

## 通过 exists 查询字段存在的数据
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }
}

## term 对多字段查询是包含关系，而不是精确匹配
## 这样会查询出 one 和 one two 两条数据
POST /product/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "tag.keyword": "one"
        }
      }
    }
  }
}

## 只想查询出 one
## 增加一个 tag_count 字段，再结合 bool query 进行查询

搜索的相关性算分

TF-IDF

BM25

在查询中添加 "explan": true 可以在结果中查询分数的计算方式

bool Query

must 必须匹配，贡献算分

should 选择性匹配，贡献算分

must_not 必须不匹配

filter 必须匹配，不贡献算分

bool 查询可以嵌套

通过修改嵌套结构，可以影响算分

## 可以通过 boost 修改得分
## 通过修改 tag 和 price 的字段得分，会影响最后查询出来结果的顺序
POST /product/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "tag": {
              "query": "one",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "price": {
              "query": "30",
              "boost": 1
            }
          }
        }
      ]
    }
  }
}

## 使用 boosting 可以提升某个值的分数、降低某个值的分数
POST /product/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "tag": "one"
        }
      },
      "negative": {
         "match": {
          "tag": "two"
        }
      },
      "negative_boost": 0.2
    }
  }
}

单字符串多字段

POST /product/_bulk
{ "index":{"_id":1}}
{"title":"Quick brown rabbits","body":"Brown rabbits are commonly seen"}
{ "index":{"_id":2}}
{"title":"Keeping pets healthy","body":"My quick brown fox eats rabbits on a regular basis"}

POST /product/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "Brown fox"
          }
        },
        {
          "match": {
            "body": "Brown fox"
          }
        }
      ]
    }
  }
}

POST /product/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick fox"
          }
        },
        {
          "match": {
            "body": "Quick fox"
          }
        }
      ]
    }
  }
}

## 如果查询出来有评分相同的，可以添加一个 tie_breaker 系数，让评分产生差异
## tie_breaker 是一个介于 0-1 之间的浮点数
## 0 表示使用最佳匹配
## 1 表示所有语句同等重要
POST /product/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick pets"
          }
        },
        {
          "match": {
            "body": "Quick pets"
          }
        }
      ],
      "tie_breaker": 0.7
    }
  }
}

multi_match 查询

//LCLTODO 整个还不是很理解

POST /product/_search
{
  "query": {
    "multi_match": {
      "query": "brown",
      "fields": ["title","body"]
    }
  }
}

中文分词器

hanlp

icu

pingyin

Search Template

解耦

## 创建一个 search template
POST _scripts/queryProduct
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "multi_match": {
          "query": "{{q}}",
          "fields": [
            "title"
          ]
        }
      }
    }
  }
}

GET _scripts/queryProduct

## 使用 template 进行查询
POST product/_search/template
{
  "id":"queryProduct",
  "params": {
    "q":"pets"
  }
}

Funcation Score Query

可以在查询结束后，对每一个匹配的文档进行一系列的重新算分，根据新生成的分数进行排序

默认的几种排序方式：

Weight 为每个文档设置一个简单而不规范化的权重
Field Value Factor 使用该数值修改 _score
Random Score
衰减函数以某个字段的值作为标准，距离某个值越近，得分越高
Script Score 自定义脚本完全控制得分逻辑

PUT shop/_doc/1
{
  "title": "Apple pie",
  "price": 8
}

PUT shop/_doc/2
{
  "title": "Orange pie",
  "price": 3
}

PUT shop/_doc/1
{
  "title": "Watermelon pie",
  "price": 6
}

POST /shop/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "e",
          "fields": "title"
        }
      },
      "field_value_factor": {
        "field": "price"
      }
    }
  }
}