day3： elasticsearch的聚合查询

感谢博主的贡献： https://juejin.im/post/6844904032398475278#heading-1

聚合基础：
https://juejin.im/post/6844904032398475278#heading-1

聚合深入理解：
Elasticsearch：aggregation介绍
Elasticsearch：pipeline aggregation 介绍
Elasticsearch：透彻理解Elasticsearch中的Bucket aggregation

查找不同的年龄段：

GET twitter/_search

{
	"size": 0,
	"age": {
		"range": {
			"field": "age",
			"ranges": [{
					"from": 20,
					"to": 30
				},
				{
					"from": 30,
					"to": 40
				},
				{
					"from": 40,
					"to": 50
				}
			]
		}
	}
}

使用range类型的聚合

在上面我们定义了不同的年龄段。通过上面的查询，我们可以得到不同年龄段的bucket。显示的结果如下，符合条件的文档在 hits.hits列表中以一个个的字典存在：

{
	"took": 4,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 5,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
		"age": {
			"buckets": [{
					"key": "20.0-30.0",
					"from": 20.0,
					"to": 30.0,
					"doc_count": 0
				},
				{
					"key": "30.0-40.0",
					"from": 30.0,
					"to": 40.0,
					"doc_count": 3
				},
				{
					"key": "40.0-50.0",
					"from": 40.0,
					"to": 50.0,
					"doc_count": 0
				}
			]
		}
	}
}

统计关键字出现的频率：
内置关键字 aggs，terms， field， keyword
curl -H 'Content-type: application/json' -XGET 'http://localhost:10290/apollo/_search?pretty' -d '{"aggs":{"number_of_cities":{"terms":{"field":"city.keyword"}}}, "size":0}'

{
	"aggs": {
		"number_of_cities": {
			"terms": {
				"field": "city.keyword"
			}
		}
	},
	"size": 0
}

得到

{
	"took": 3,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 71150,
		"max_score": 0.0,
		"hits": []
	},
	"aggregations": {
		"number_of_cities": {
			"doc_count_error_upper_bound": 116,
			"sum_other_doc_count": 16983,
			"buckets": [{
					"key": "合肥",
					"doc_count": 30017
				},
				{
					"key": "",
					"doc_count": 16761
				},
				{
					"key": "columbia",
					"doc_count": 1546
				}
			]
		}
	}
}

统计城市出现的个数：
到底有多少个城市，内置关键字 cardinality
XGET _search { "size": 0, "aggs": { "number_of_cities": { "cardinality": { "field": "city.keyword" } } } }

{
	"size": 0,
	"aggs": {
		"number_of_cities": {
			"cardinality": {
				"field": "city.keyword"
			}
		}
	}
}

统计用户平均年龄：
内置函数 avg
GET twitter/_search { "size": 0, "aggs": { "average_age": { "avg": { "field": "age" } } } }

统计平均分 avg，最大分 max，最小分 min，总和 sum
curl -H 'Content-type: application/json' -XGET 'http://localhost:10290/apollo/_search?pretty' -d '{"aggs":{"average_score":{"avg":{"field":"os_score"}}}, "size":0}'

{
	"aggs": {
		"average_score": {
			"avg": {
				"field": "os_score"
			}
		}
	},
	"size": 0
}

通过script的方法来对我们的aggregtion结果进行重新计算：
最大分的基础上乘以 0.8 用 *，除以 2 用 / , 加上一个数用 +，减去一个数用 - ，
curl -H 'Content-type: application/json' -XGET 'http://localhost:10290/apollo/_search?pretty' -d '{"aggs":{"average_score":{"max":{"field":"os_score", "script":{"source":"_value * params.correction", "params":{"correction": 0.8}}}}}, "size":0}'

{
	"size": 0,
	"aggs": {
		"average_score": {
			"max": {
				"field": "os_score",
				"script": {
					"source": "_value * params.correction",
					"params": {
						"correction": 0.8
					}
				}
			}
		}
	}
}

不用 field，直接使用 script 聚合：
与上述效果等价，尝试未成功
GET twitter/_search

{
	"size": 0,
	"aggs": {
		"average_2_times_os_score": {
			"avg": {
				"script": {
					"source": "doc['os_score'].value * params.times",
					"params": {
						"times": 2.0
					}
				}
			}
		}
	}
}

Percentile aggregation
百分位数聚合，如下语句可查出 os_score 的离群值，得到了 25， 50， 75， 100 的分数占比

{
	"size": 0,
	"aggs": {
		"os_score_quartiles": {
			"percentiles": {
				"field": "os_score",
				"percents": [
					25,
					50,
					75,
					100
				]
			}
		}
	}
}

查找结果如下，可以看到
25% 的分数为 90 分以下
50% 的分数在 92 分以下
75% 的分数在 100 分以下
最高分为 100 分

{
	"took": 8,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 71150,
		"max_score": 0.0,
		"hits": []
	},
	"aggregations": {
		"os_score_qualities": {
			"values": {
				"25.0": 90.0,
				"50.0": 92.0,
				"75.0": 100.0,
				"100.0": 100.0
			}
		}
	}
}

analyzer

实现秒级的搜索速度的原因之一：文档被存储时加了索引

curl -H 'Content-type: application/json' -XGET 'http://localhost:10290/apollo/_analyze?pretty' -d '{"text":["我是一个兵"], "analyzer":"standard"}'

{
	"text": ["我是一个兵"],
	"analyzer": "standard"
}

结果如下，五个token

{
	"tokens": [{
			"token": "我",
			"start_offset": 0,
			"end_offset": 1,
			"type": "<IDEOGRAPHIC>",
			"position": 0
		},
		{
			"token": "是",
			"start_offset": 1,
			"end_offset": 2,
			"type": "<IDEOGRAPHIC>",
			"position": 1
		},
		{
			"token": "一",
			"start_offset": 2,
			"end_offset": 3,
			"type": "<IDEOGRAPHIC>",
			"position": 2
		},
		{
			"token": "个",
			"start_offset": 3,
			"end_offset": 4,
			"type": "<IDEOGRAPHIC>",
			"position": 3
		},
		{
			"token": "兵",
			"start_offset": 4,
			"end_offset": 5,
			"type": "<IDEOGRAPHIC>",
			"position": 4
		}
	]
}