elasticsearch搜索语法梳理

前言

elasticsearch的核心在搜索，搜索的核心在搜索语法，所以今天我们来梳理下elasticsearch的一些搜索语法，今天主要探讨搜索，主要包括两方面的内容，一方面是普通的query，也就是数据的检索，另一方面就是内容的聚合，也就是传统sql中的分组。

好了，下面我们就来看下这两种搜索查询的具体操作吧。

搜索语法

全文搜索

全文搜索的规则是会匹配凡是包括我们检索内容的任一单词，都会将结果予以展示，它并不关系单词顺序。对于下面我们的搜索表达式，它会搜索about中包括go和reading的所有内容。

搜索语法：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {
            "about" : "go reading"
        }
    }
}
'

返回结果：

7{
  "took" : 220,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : 0.9161128,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.9161128,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 15,
          "about" : "I love to go reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "6",
        "_score" : 0.9161128,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "7",
        "_score" : 0.8500352,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading and go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "8",
        "_score" : 0.8500352,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go and reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "5",
        "_score" : 0.6706225,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.2761543,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to climbing go rock",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 0.2761543,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

从结果中，我们可以看到，go reading和reading go这样的内容都被匹配到了，同时他们的匹配度都是相同的，当然他们也是匹配度最高的。而且检索内容还匹配到了包含go和reading的内容，但是包含reading匹配项的匹配度比包含go匹配项的匹配度要高。

目前还不清楚它这个匹配度是如何计算的，现在只要知道_score越高，表示匹配度越高就行了。

短语搜索

相比于全文搜索，短语搜索属于更精确的搜索。我们前面刚说过，全文搜索会将内容拆分之后进行搜索，属于更模糊的搜索，但是短语搜索必须匹配到完整的短语才算。所以对于下面的检索语句，它只会匹配包含go reading的内容：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "about" : "go reading"
        }
    }
}
'

返回结果：

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.1097687,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.1097687,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 15,
          "about" : "I love to go reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

从结果我们可以看出来，最终结果只匹配到了包括go reading这个短语的内容，包含其中任一单词的并没有被匹配到。这也说明，短语匹配必须匹配完整短语

高亮搜索

高亮搜索简单来说，就是将我们检索到的内容进行高亮处理，默认情况下，会将检索到的内容加上em标签：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}
'

返回结果：

{
  "took" : 116,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8434994,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "4",
        "_score" : 1.8434994,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        },
        "highlight" : {
          "about" : [
            "I love to <em>rock</em> <em>climbing</em>"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 1.7099125,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        },
        "highlight" : {
          "about" : [
            "I love to go <em>rock</em> <em>climbing</em>"
          ]
        }
      }
    ]
  }
}

当然，高亮样式是支持自定义的：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
         "pre_tags" : ["<p style="color:red">"],
        "post_tags" : ["</p>"],
        "fields" : {
            "about" : {}
        }
    }
}
'

返回结果：

{
 ...
  "hits" : {
    ...
    "hits" : [
      {
       ...
        },
        "highlight" : {
          "about" : [
            "I love to <p style="color:red">rock</p> <p style="color:red">climbing</p>"
          ]
        }
      },
      {
       ...
        },
        "highlight" : {
          "about" : [
            "I love to go <p style="color:red">rock</p> <p style="color:red">climbing</p>"
          ]
        }
      }
    ]
  }
}

数据分析

下面我们就来看下简单的数据统计，下面的表达式是按age进行分组，统计数据。这在es的专业术语叫聚合查询，类似于传统SQL中的group by，和传统的SQL很像：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "age" }
    }
  }
}
'

返回结果：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 9,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 15,
          "about" : "I love to go reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to climbing go rock",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "7",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading and go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "8",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go and reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "9",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to do something",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "all_interests" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 8
        },
        {
          "key" : 15,
          "doc_count" : 1
        }
      ]
    }
  }
}

从返回结果中，我们可以看出，age 为25的数据有8条，age为15的数据有1条。

query和aggs组合使用

这里的聚合查询aggs是可以和query同时存在的,，就和select name, count(*) from user group by name一样：

curl -X GET "localhost:9200/megacorp/employee/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {
            "about" : "go reading"
        }
    },
    "aggs": {
    "all_interests": {
      "terms": { "field": "age" }
    }
  }
}
'

返回结果：

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : 1.1097689,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.1097689,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 15,
          "about" : "I love to go reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "6",
        "_score" : 1.1097689,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "7",
        "_score" : 1.0293508,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading and go",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "8",
        "_score" : 1.0293508,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go and reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "5",
        "_score" : 0.7753851,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to reading",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.36634043,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to climbing go rock",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 0.36634043,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "all_interests" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 6
        },
        {
          "key" : 15,
          "doc_count" : 1
        }
      ]
    }
  }
}

在返回结果中aggregations就是聚合查询的结果，buckets（桶）表示最后聚合结果，每个通（bucket）表示一条聚合结果。

总结

实话实说，elasticsearch的搜索语法确实还是很复杂的，截止到今天，按照官方文档内容的章节安排，搜索部分的入门内容已经结束了，但是我怎么觉得我像刚入门一样，还是对elasticsearch没有全面的认识。我看了下剩余的内容，咋感觉现在才算正式开始了——深入搜索、处理人类语言、聚合、地理位置、数据建模、监控等都没开始学习呢？

好吧，我承认以前对elasticsearch的认知太浅显了，它确实是一套独立的知识体系（而不是一门语言），从某种程度上说，elasticsearch重新定义了搜索，所以还是好好学习吧，干就对了！

今天周末，稍微放纵了下，刷了半天的剧，然后快到晚上才开始梳理相关内容，不过某种程度上我觉得我们还是比较自觉的，任务也算顺利完成了，明天得早点开始了。加油吧，少年！

最后，预告下明天要更新的内容，除了我们既定的elasticsearch之外，我还要完成7月份内容的更新，所以明天的内容还是比较多的。

好了，大家伙晚安吧！