Elasticsearch 入门

1. 术语

在 ElasticSearch 中,存入一个文件的动作称为索引(indexing)。对比传统关系型数据库,ElasticSearch中的类比为:

Relational DB -> Databases -> Tables -> Rows           -> Columns

Elasticsearch  -> Indices       -> Types  -> Documents -> Fields

也就是说,ElasticSearch 中包含多个索引(Indices)(数据库),每个索引可以包含多个类型(Types)(表),每个类型里包含多个文档(Documents)(行),每个文档有多个字段(Fields)(列)

2. 写入与检索操作

写入数据

下面我们看一个例子:

我们 put 一条数据到 ES:

curl -XPOST https://es_endpoint/corporation/employee/1 -d '

{

    "first_name" : "John",

    "last_name" :  "Smith",

    "age" :        25,

    "about" :      "I love to go rock climbing",

    "interests": [ "sports", "music" ]

}' -H 'Content-Type: application/json'

这里 es_endpoint 为 ElasticSearch 的终端节点,corporation 为索引(Index),employee为类型(Type),1 为 id。

在放入数据到ES后,我们即可以使用 GET 方法获取数据,如:

curl -XGET https://es_endpoint/corporation/employee/1

{"_index":"corporation",

"_type":"employee",

"_id":"1",

"_version":3,

"_seq_no":2,

"_primary_term":1,

"found":true,

"_source":

{

    "first_name" :  "Douglas",

    "last_name" :   "Fir",

    "age" :         35,

    "about":        "I like to build cabinets",

    "interests":  [ "forestry" ]

}}

ElasticSearch 中使用的是 HTTP 方法进行操作,比如 GET 方法用于检索文档,POST 方法或 PUT 方法写入文档(或是更新文档)。DELETE 方法用于删除文档,HEAD 方法用于检查某文档是否存在。

获取数据

 GET 方法可以通过 id 获取唯一文档,不过如果需求是搜索文档,则可以使用如下方式,将 id 换为_search:

curl -XGET https://es_endpoint/corporation/employee/_search

检索数据

使用这个方式会将类型为employee中的所有文档均检索出来,若是需要进行条件检索,则可以用:

curl -XGET https://es_endpoint/corporation/employee/_search?q=first_name:Jane

查询结果为:

{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":

{

    "first_name" :  "Jane",

    "last_name" :   "Smith",

    "age" :         32,

    "about" :       "I like to collect rock albums",

    "interests":  [ "music" ]

}}]}}

DSL 检索

以上查询仅用于一些简单查询场景,ElasticSearch 提供了更丰富且灵活的查询语言,DSL(Domain Specific Language)。此查询以 JSON 的方式进行请求,例如对于上一个简单查询,我们可以改写为:

curl -XGET https://es_endpoint/corporation/employee/_search -d '

{

    "query" : {

         "match" : {

             "first_name" : "Jane"

         }

    }

} ' -H 'Content-Type: application/json'

查询结果为:

{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":

{

    "first_name" :  "Jane",

    "last_name" :   "Smith",

    "age" :         32,

    "about" :       "I like to collect rock albums",

    "interests":  [ "music" ]

}}]}}

更复杂的检索

我们在查询语句中加入一个过滤器,过滤掉年纪大于 30 岁的员工:

curl -XGET https://es_endpoint/corporation/employee/_search -d '

{

    "query" : {

        "bool" : {

            "filter" : {

                "range" : {

                    "age" : { "gt" : 30 }

                }

            },

            "must" : {

                "match" : {

                    "last_name" : "smith"

                }

            }

        }

    }

} ' -H 'Content-Type: application/json'

这里我们用了一个过滤器(fliter),将年龄大于30岁的文档进行过滤,然后匹配last_name 为 smith 的温度。

全文搜索

在全文搜索中,我们可以指定文档中任意字段的数据,进行全文检索,例如:

curl https://es_endpoint/corporation/employee/_search -d '

{

    "query" : {

        "match" : {

            "about" : "rock climbing"

        }

    }

} ' -H 'Content-Type: application/json'

结果为:

{"took":9,

"timed_out":false,

"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},

"hits":{"total":{"value":2,"relation":"eq"},

"max_score":0.5753642,

"hits":[

{"_index":"corporation",

 "_type":"employee",

 "_id":"1",

 "_score":0.5753642,

 "_source":

{

    "first_name" : "John",

    "last_name" :  "Smith",

    "age" :        25,

    "about" :      "I love to go rock climbing",

    "interests": [ "sports", "music" ]

}},

{"_index":"corporation",

 "_type":"employee",

 "_id":"2",

 "_score":0.2876821,

 "_source":

{

    "first_name" :  "Jane",

    "last_name" :   "Smith",

    "age" :         32,

    "about" :       "I like to collect rock albums",

    "interests":  [ "music" ]

}}]}}

可以看到两个返回的文档中有_score 的字段,这个字段表示的是:与匹配条件的相关性。返回的文档按相关性降序排序。可以看到我们检索的条件有 rock climbing,但是仅包含 rock 的第二个文档也被检索出来,但是相关性低于第一个文档。

短语检索

上面的检索进行了 rock climbing 的模糊匹配,若是要进行此短语的精确匹配,则可以将match 改为 match_phrase,如:

https://es_endpoint/corporation/employee/_search -d '

{

    "query" : {

        "match_phrase" : {

            "about" : "rock climbing"

        }

    }

} ' -H 'Content-Type: application/json'

高亮搜索

很多应用中,需要对搜索中匹配到的关键词进行高亮(highlight),这样可以直观地查看到查询的匹配。ElasticSearch 直接提供了高亮的功能,在语句上增加highlight 的参数即可,例如:

curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '

{

    "query" : {

        "match_phrase" : {

            "about" : "rock climbing"

        }

    },

    "highlight": {

        "fields" : {

            "about" : {}

        }

    }

}' -H 'Content-Type: application/json'

结果为:

{"took":44,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.5753642,"hits":[{"_index":"corporation","_type":"employee","_id":"1","_score":0.5753642,"_source":

{

    "first_name" : "John",

    "last_name" :  "Smith",

    "age" :        25,

    "about" :      "I love to go rock climbing",

    "interests": [ "sports", "music" ]

},"highlight":{"about":["I love to go <em>rock</em> <em>climbing</em>"]}}]}}

可以看到返回的结果中多了一个新的字段为“highlight”,此字段中包含了about 中匹配到的文本,并使用了<em></em>用于标识匹配到的单词。

3. 聚合操作

在数据分析的场景中,我们需要对文档进行一些统计分析。ElasticSearch 提供了一个功能叫聚合(aggregations),它可以让我们在数据上生成复杂的统计分析。此功能类似于 SQL 中的 group by,但是功能更强大。

例如,我们需要找到所有employee中最多的兴趣爱好:

curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '

{

  "aggs": {

    "all_interests": {

      "terms": { "field": "interests.keyword" }

    }

  }

}' -H 'Content-Type: application/json'

返回的结果为:

…前面的结果忽略,我们仅看统计信息:

"aggregations": {

            "all_interests": {

                  "doc_count_error_upper_bound": 0,

                  "sum_other_doc_count": 0,

                  "buckets": [{

                        "key": "music",

                        "doc_count": 2

                  }, {

                        "key": "forestry",

                        "doc_count": 1

                  }, {

                        "key": "sports",

                        "doc_count": 1

                  }]

            }

      }

可以看到有两个员工的兴趣爱好为 music,对forestry与sports 感兴趣的员工均只有一名。

References:

https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html

原文地址:https://www.cnblogs.com/zackstang/p/12021845.html