Elasticsearch 常用的聚合操作

Aggregation 简介

ps : 本篇文章 Elasticsearch 和 Kibana 版本为 7.10.1。如果版本不一致请查看官方文档,避免误导!

聚合框架有助于基于搜索查询提供聚合数据。它基于称为聚合的简单构建块,可以组合以构建复杂的数据摘要。

Elasticsearch 将聚合分为三类:

  • Metric (指标聚合)

    从字段值计算度量的聚合,例如最大、最小、总和和平均值。

  • Bucket (桶聚合)

    根据字段值、范围或其他条件将文档分组为桶(也称为箱),类似于关系型数据库中的group by。

  • Pipeline (管道聚合)

    从其他聚合而不是文档或字段中获取输入的聚合。

聚合可以将我们的数据汇总为指标,统计或其他分析信息。使用聚合可以为我们带来的好处:

  • 我的网站的平均加载时间是多少?
  • 根据交易量,谁是我最有价值的客户?
  • 什么会被视为我网络上的大文件?
  • 每个产品类别中有多少个产品?

数据准备

创建索引

DELETE twitter

PUT twitter
{
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    }, 
    "mappings": {
        "properties": {
            "birthday": {
                "type": "date"
            },
            "address": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "age": {
                "type": "long"
            },
            "city": {
                "type": "keyword"
            },
            "country": {
                "type": "keyword"
            },
            "location": {
                "type": "geo_point"
            },
            "message": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "province": {
                "type": "keyword"
            },
            "uid": {
                "type": "long"
            },
            "user": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            }
        }
    }
}

导入数据

使用 Bulk API 将数据导入到 Elasticsearch 中:

POST _bulk
{"index":{"_index":"twitter","_id":1}}
{"user":"张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}, "birthday": "1999-04-01"}
{"index":{"_index":"twitter","_id":2}}
{"user":"老刘","message":"出发,下一站云南!","uid":3,"age":22,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}, "birthday": "1997-04-01"}
{"index":{"_index":"twitter","_id":3}}
{"user":"李四","message":"happy birthday!","uid":4,"age":25,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}, "birthday": "1994-04-01"}
{"index":{"_index":"twitter","_id":4}}
{"user":"老贾","message":"123,gogogo","uid":5,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}, "birthday": "1989-04-01"}
{"index":{"_index":"twitter","_id":5}}
{"user":"老王","message":"Happy BirthDay My Friend!","uid":6,"age":26,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}, "birthday": "1993-04-01"}
{"index":{"_index":"twitter","_id":6}}
{"user":"老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":28,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}, "birthday": "1991-04-01"}

注意:并不是所有字段都可以用来做聚合,一般来说,只有具有 keyword或者数值类型的字段是可以用来做聚合。

我们可以通过 _field_cat 命令还查询文档中的字段是否可以作为聚合:

GET twitter/_field_caps?fields=message,age,province,city.keyword

从结果我们可以看到四个字段都可以用来做搜索的,但是只有 agecity.keyword才可以用来做聚合

{
  "indices" : [
    "twitter"
  ],
  "fields" : {
    "province" : {
      "text" : {
        "type" : "text",
        "searchable" : true,
        "aggregatable" : false
      }
    },
    "message" : {
      "text" : {
        "type" : "text",
        "searchable" : true,
        "aggregatable" : false
      }
    },
    "city.keyword" : {
      "keyword" : {
        "type" : "keyword",
        "searchable" : true,
        "aggregatable" : true
      }
    },
    "age" : {
      "long" : {
        "type" : "long",
        "searchable" : true,
        "aggregatable" : true
      }
    }
  }
}

searchable

是否为所有索引上的搜索都索引了该字段。

aggregatable

是否可以在所有索引上汇总此字段。

indices

该字段具有相同类型族的索引列表;如果所有索引具有相同的类型族,则为null。

non_searchable_indices

该字段不可搜索的索引列表;如果所有索引对该字段的定义相同,则为null。

non_aggregatable_indices

该字段不可聚合的索引列表;如果所有索引对该字段的定义相同,则为null。

聚合操作 语法

"aggregations" : {
    "<aggregation_name>" : { <!--聚合的名字 -->
        "<aggregation_type>" : { <!--聚合的类型 -->
            <aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
        }
        [,"meta" : {  [<meta_data_body>] } ]? <!--元 -->
        [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 -->
    }
    [,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
}

上面的 aggregation 可以使用 aggs 来代替

Metric 聚合操作

Avg Sum Max Min 聚合

Avg Aggregation : 一个单值度量聚合,计算从聚合文档中提取的数值的平均值。

Sum Aggregation :sum聚合对从聚合文档中提取的数值进行汇总的单值度量。

Max Aggregation :一个单值度量聚合,用于跟踪并返回从聚合文档中提取的数值中的最大值。

Min Aggregation :一个单值度量聚合,用于跟踪并返回从聚合文档中提取的数值中的最小值。

这些值可以从文档中的特定数字字段中提取,也可以由提供的脚本生成。

查询 twitter 索引下文档 age 的 平均值、总和、最大值及最小值:

GET twitter/_search?size=0
{
  "aggs": {
    "age_avg": {
      "avg": {
        "field": "age"
      }
    },
    "age_sum":{
      "sum": {
        "field": "age"
      }
    },
    "age_max":{
      "max": {
        "field": "age"
      }
    },
    "age_min":{
      "min": {
        "field": "age"
      }
    }
  }
}

返回结果:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_sum" : {
      "value" : 151.0
    },
    "age_min" : {
      "value" : 20.0
    },
    "age_avg" : {
      "value" : 25.166666666666668
    },
    "age_max" : {
      "value" : 30.0
    }
  }
}

Stats 聚合

数据聚合一个多值指标聚合,它根据从聚合文档中提取的数值计算统计信息。

返回的统计数据包括:最小值,最大值,和;

汇总所有文档的年龄统计

GET twitter/_search?size=0
{
  "query": {
    "match": {
      "city": "北京"
    }
  }, 
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

返回结果

{
    "took" : 0,
    "timed_out" : false,
    "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : {
            "value" : 5,
            "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
    },
    "aggregations" : {
        "age_stats" : {
            "count" : 5,
            "min" : 20.0,
            "max" : 30.0,
            "avg" : 24.6,
            "sum" : 123.0
        }
    }
}

Bucket 聚合操作

Range 聚合(multi-bucket)

基于多桶值源的聚合,可以定义一组范围(每个范围代表一个桶)。在聚合过程中,将从每个存储区范围中检查并从文档中提取值

注意:此聚合包含每个范围的 from 值,但不包括 to 值。

将年龄进行分段,查询不同年龄段的用户:

GET twitter/_search
{
    "size": 0, 
    "aggs": {
        "age_range": {
            "range": {
                "field": "age",
                "ranges": [
                    {
                        "from": 20,
                        "to": 22
                    },
                    {
                        "from": 22,
                        "to": 25
                    },
                    {
                        "from": 25,
                        "to": 30
                    }
                ]
            }
        }
    }
}

上面我们使用 range 类型的聚合,定义了不同的年龄段。通过上面的查询,得到了不同年龄段的 bucket。并且因为是针对聚合,我们并不关心返回的结果,通过 size=0 忽略了返回结果。得到了以下输出:

{
    "took" : 2,
    "timed_out" : false,
    "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : {
            "value" : 6,
            "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
    },
    "aggregations" : {
        "age_range" : {
            "buckets" : [
                {
                    "key" : "20.0-22.0",
                    "from" : 20.0,
                    "to" : 22.0,
                    "doc_count" : 1
                },
                {
                    "key" : "22.0-25.0",
                    "from" : 22.0,
                    "to" : 25.0,
                    "doc_count" : 1
                },
                {
                    "key" : "25.0-30.0",
                    "from" : 25.0,
                    "to" : 30.0,
                    "doc_count" : 3
                }
            ]
        }
    }
}

Sub-aggregation

在聚合的内部嵌套一个聚合。

在 range 操作之中,我们可以做 sub-aggregation。分别来计算它们的平均年龄、最大以及最小的年龄!

GET twitter/_search
{
    "size": 0, 
    "aggs": {
        "age_range": {
            "range": {
                "field": "age",
                "ranges": [
                    {
                        "from": 20,
                        "to": 22
                    },
                    {
                        "from": 22,
                        "to": 25
                    },
                    {
                        "from": 25,
                        "to": 30
                    }
                ]
            },
            "aggs": {
                "age_avg": {
                    "avg": {
                        "field": "age"
                    }
                },
                "age_min":{
                    "min": {
                        "field": "age"
                    }
                },
                "age_max":{
                    "max": {
                        "field": "age"
                    }
                }
            }
        }
    }
}

上面的查询结果为:

{
    "took" : 1,
    "timed_out" : false,
    "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : {
            "value" : 6,
            "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
    },
    "aggregations" : {
        "age_range" : {
            "buckets" : [
                {
                    "key" : "20.0-22.0",
                    "from" : 20.0,
                    "to" : 22.0,
                    "doc_count" : 1,
                    "age_min" : {
                        "value" : 20.0
                    },
                    "age_avg" : {
                        "value" : 20.0
                    },
                    "age_max" : {
                        "value" : 20.0
                    }
                },
                {
                    "key" : "22.0-25.0",
                    "from" : 22.0,
                    "to" : 25.0,
                    "doc_count" : 1,
                    "age_min" : {
                        "value" : 22.0
                    },
                    "age_avg" : {
                        "value" : 22.0
                    },
                    "age_max" : {
                        "value" : 22.0
                    }
                },
                {
                    "key" : "25.0-30.0",
                    "from" : 25.0,
                    "to" : 30.0,
                    "doc_count" : 3,
                    "age_min" : {
                        "value" : 25.0
                    },
                    "age_avg" : {
                        "value" : 26.333333333333332
                    },
                    "age_max" : {
                        "value" : 28.0
                    }
                }
            ]
        }
    }
}

Filters 聚合 (multi-bucket)

使用 Filter 聚合定义一个多存储桶聚合,每个存储桶都与一个过滤器相关。每个存储桶将收集与其关联的过滤器相匹配的所有文档。

在上面我们使用 Range 将数据拆分成了不同的 Bucket,但是这种方式只适合字段为数字的字段。我们可以使用 Filter 聚合来对非数字字段来建立不同的 Bucket。

GET twitter/_search
{
    "size": 0,
    "aggs": {
        "city_filters": {
            "filters": {
                "filters": {
                    "beijing": {
                        "match":{
                            "city":"北京"
                        }
                    },
                    "shanghai":{
                        "match":{
                            "city":"上海"
                        }
                    }
                }
            }
        }
    }
}

上面的查询结果显示有5个北京的文档,一个上海的文档。并且每个filter都有自己的名字:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "city_filter" : {
      "buckets" : {
        "beijing" : {
          "doc_count" : 5
        },
        "shanghai" : {
          "doc_count" : 1
        }
      }
    }
  }
}

Filter 聚合 (single-bucket)

在当前文档上下文中定义与指定过滤器匹配的所有文档的单个存储桶。通常将用于将当前聚合上下文缩小到一组特定的文档。

查询城市为 北京 的文档,并求平均年龄、最大以及最小年龄:

GET twitter/_search
{
  "size":0,
  "aggs": {
    "agg_filter": {
      "filter": {
        "match":{
          "city":"北京"
        }
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        },
        "avg_max":{
          "max": {
            "field": "age"
          }
        },
        "avg_min":{
          "min": {
            "field": "age"
          }
        }
      }
    }
  }
}

查询结果为:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agg_filter" : {
      "doc_count" : 5,
      "avg_min" : {
        "value" : 20.0
      },
      "avg_max" : {
        "value" : 30.0
      },
      "age_avg" : {
        "value" : 24.6
      }
    }
  }
}

Date Range 聚合 (multi-bucket)

专用于日期值的范围聚合。此聚合与正常范围聚合之间的主要区别是,from和to值可以用Date Math表达式表示,而且还可以指定返回from和to响应字段的日期格式。

注意:对于每个范围,此聚合包括from值,排除to值。

根据生日范围查询文档:

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "birthday_range": {
      "date_range": {
        "field": "birthday",
        "format": "yyyy-MM-dd", 
        "ranges": [
          {
            "from": "1989-04-01",
            "to": "1997-04-01"
          },
          {
            "from": "1994-04-01",
            "to": "1999-04-01"
          }
        ]
      }
    }
  }
}

查询结果:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "birthday_range" : {
      "buckets" : [
        {
          "key" : "1989-04-01-1997-04-01",
          "from" : 6.07392E11,
          "from_as_string" : "1989-04-01",
          "to" : 8.598528E11,
          "to_as_string" : "1997-04-01",
          "doc_count" : 4
        },
        {
          "key" : "1994-04-01-1999-04-01",
          "from" : 7.651584E11,
          "from_as_string" : "1994-04-01",
          "to" : 9.229248E11,
          "to_as_string" : "1999-04-01",
          "doc_count" : 2
        }
      ]
    }
  }
}

Terms 聚合 (multi-bucket)

基于多桶值源的聚合,其中动态构建桶-每个唯一值一个。

可以根据 terms 聚合查询关键字出现的频率。下面我们查询在所有文档中出现 happy birthday 关键字并按照城市进行分类:

GET twitter/_search
{
  "query": {
    "match": {
      "message": "happy birthday"
    }
  },
  "size": 0, 
  "aggs": {
    "city_terms": {
      "terms": {
        "field": "city.keyword",
        "size": 10,
        "order": {
          "_count": "asc"
        }
      }
    }
  }
}

size=10 指的是排名前十的城市。并通过 doc_count 进行排序。聚合的结果为:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "city_terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "上海",
          "doc_count" : 1
        },
        {
          "key" : "北京",
          "doc_count" : 2
        }
      ]
    }
  }
}

histogram 聚合

基于多桶值源的汇总,可以应用于从文档中提取数值或数值范围值。根据值动态构建固定大小(也称为间隔)的存储桶。

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_histogram": {
      "histogram": {
        "field": "age",
        "interval": 2
      }
    }
  }
}
  • interval : 间隔为2

返回结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_histogram" : {
      "buckets" : [
        {
          "key" : 20.0,
          "doc_count" : 1
        },
        {
          "key" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : 24.0,
          "doc_count" : 1
        },
        {
          "key" : 26.0,
          "doc_count" : 1
        },
        {
          "key" : 28.0,
          "doc_count" : 1
        },
        {
          "key" : 30.0,
          "doc_count" : 1
        }
      ]
    }
  }
}
If you’re going to reuse code, you need to understand that code!
原文地址:https://www.cnblogs.com/leizzige/p/14822038.html