Elasticsearch 入门

1. 术语

在 ElasticSearch 中，存入一个文件的动作称为索引（indexing）。对比传统关系型数据库，ElasticSearch中的类比为：

Relational DB -> Databases -> Tables -> Rows -> Columns

Elasticsearch -> Indices -> Types -> Documents -> Fields

也就是说，ElasticSearch 中包含多个索引（Indices）（数据库），每个索引可以包含多个类型（Types）（表），每个类型里包含多个文档（Documents）（行），每个文档有多个字段（Fields）（列）

2. 写入与检索操作

写入数据

下面我们看一个例子：

我们 put 一条数据到 ES：

curl -XPOST https://es_endpoint/corporation/employee/1 -d '

{

"first_name" : "John",

"last_name" : "Smith",

"age" : 25,

"about" : "I love to go rock climbing",

"interests": [ "sports", "music" ]

}' -H 'Content-Type: application/json'

这里 es_endpoint 为 ElasticSearch 的终端节点，corporation 为索引（Index），employee为类型（Type），1 为 id。

在放入数据到ES后，我们即可以使用 GET 方法获取数据，如：

curl -XGET https://es_endpoint/corporation/employee/1

{"_index":"corporation",

"_type":"employee",

"_id":"1",

"_version":3,

"_seq_no":2,

"_primary_term":1,

"found":true,

"_source":

{

"first_name" : "Douglas",

"last_name" : "Fir",

"age" : 35,

"about": "I like to build cabinets",

"interests": [ "forestry" ]

}}

ElasticSearch 中使用的是 HTTP 方法进行操作，比如 GET 方法用于检索文档，POST 方法或 PUT 方法写入文档（或是更新文档）。DELETE 方法用于删除文档，HEAD 方法用于检查某文档是否存在。

获取数据

GET 方法可以通过 id 获取唯一文档，不过如果需求是搜索文档，则可以使用如下方式，将 id 换为_search：

curl -XGET https://es_endpoint/corporation/employee/_search

检索数据

使用这个方式会将类型为employee中的所有文档均检索出来，若是需要进行条件检索，则可以用：

curl -XGET https://es_endpoint/corporation/employee/_search?q=first_name:Jane

查询结果为：

{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":

{

"first_name" : "Jane",

"last_name" : "Smith",

"age" : 32,

"about" : "I like to collect rock albums",

"interests": [ "music" ]

}}]}}

DSL 检索

以上查询仅用于一些简单查询场景，ElasticSearch 提供了更丰富且灵活的查询语言，DSL（Domain Specific Language）。此查询以 JSON 的方式进行请求，例如对于上一个简单查询，我们可以改写为：

curl -XGET https://es_endpoint/corporation/employee/_search -d '

{

"query" : {

"match" : {

"first_name" : "Jane"

}

} ' -H 'Content-Type: application/json'

查询结果为：

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":

{

"first_name" : "Jane",

"last_name" : "Smith",

"age" : 32,

"about" : "I like to collect rock albums",

"interests": [ "music" ]

}}]}}

更复杂的检索

我们在查询语句中加入一个过滤器，过滤掉年纪大于 30 岁的员工：

curl -XGET https://es_endpoint/corporation/employee/_search -d '

{

"query" : {

"bool" : {

"filter" : {

"range" : {

"age" : { "gt" : 30 }

}

"must" : {

"match" : {

"last_name" : "smith"

}

} ' -H 'Content-Type: application/json'

这里我们用了一个过滤器（fliter），将年龄大于30岁的文档进行过滤，然后匹配last_name 为 smith 的温度。

全文搜索

在全文搜索中，我们可以指定文档中任意字段的数据，进行全文检索，例如：

curl https://es_endpoint/corporation/employee/_search -d '

{

"query" : {

"match" : {

"about" : "rock climbing"

}

} ' -H 'Content-Type: application/json'

结果为：

{"took":9,

"timed_out":false,

"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},

"hits":{"total":{"value":2,"relation":"eq"},

"max_score":0.5753642,

"hits":[

{"_index":"corporation",

"_type":"employee",

"_id":"1",

"_score":0.5753642,

"_source":

{

"first_name" : "John",

"last_name" : "Smith",

"age" : 25,

"about" : "I love to go rock climbing",

"interests": [ "sports", "music" ]

}},

{"_index":"corporation",

"_type":"employee",

"_id":"2",

"_score":0.2876821,

"_source":

{

"first_name" : "Jane",

"last_name" : "Smith",

"age" : 32,

"about" : "I like to collect rock albums",

"interests": [ "music" ]

}}]}}

可以看到两个返回的文档中有_score 的字段，这个字段表示的是：与匹配条件的相关性。返回的文档按相关性降序排序。可以看到我们检索的条件有 rock climbing，但是仅包含 rock 的第二个文档也被检索出来，但是相关性低于第一个文档。

短语检索

上面的检索进行了 rock climbing 的模糊匹配，若是要进行此短语的精确匹配，则可以将match 改为 match_phrase，如：

https://es_endpoint/corporation/employee/_search -d '

{

"query" : {

"match_phrase" : {

"about" : "rock climbing"

}

} ' -H 'Content-Type: application/json'

高亮搜索

很多应用中，需要对搜索中匹配到的关键词进行高亮（highlight），这样可以直观地查看到查询的匹配。ElasticSearch 直接提供了高亮的功能，在语句上增加highlight 的参数即可，例如：

curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '

{

"query" : {

"match_phrase" : {

"about" : "rock climbing"

}

"highlight": {

"fields" : {

"about" : {}

}

}' -H 'Content-Type: application/json'

结果为：

{"took":44,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.5753642,"hits":[{"_index":"corporation","_type":"employee","_id":"1","_score":0.5753642,"_source":

{

"first_name" : "John",

"last_name" : "Smith",

"age" : 25,

"about" : "I love to go rock climbing",

"interests": [ "sports", "music" ]

},"highlight":{"about":["I love to go rock climbing"]}}]}}

可以看到返回的结果中多了一个新的字段为“highlight”，此字段中包含了about 中匹配到的文本，并使用了用于标识匹配到的单词。

3. 聚合操作

在数据分析的场景中，我们需要对文档进行一些统计分析。ElasticSearch 提供了一个功能叫聚合（aggregations），它可以让我们在数据上生成复杂的统计分析。此功能类似于 SQL 中的 group by，但是功能更强大。

例如，我们需要找到所有employee中最多的兴趣爱好：

curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '

{

"aggs": {

"all_interests": {

"terms": { "field": "interests.keyword" }

}

}' -H 'Content-Type: application/json'

返回的结果为：

…前面的结果忽略，我们仅看统计信息：

"aggregations": {

"all_interests": {

"doc_count_error_upper_bound": 0,

"sum_other_doc_count": 0,

"buckets": [{

"key": "music",

"doc_count": 2

}, {

"key": "forestry",

"doc_count": 1

}, {

"key": "sports",

"doc_count": 1

}]

}

可以看到有两个员工的兴趣爱好为 music，对forestry与sports 感兴趣的员工均只有一名。

References:

https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html