乾貨 | 通透理解Elasticsearch聚合

使用Elasticsearch的過程中，除了全文檢索，或多或少會做統計操作，而做統計操作勢必會使用Elasticsearch聚合操作。

類似mysql中group by的terms聚合用的最多，但當遇到複雜的聚合操作時，往往會捉襟見肘、不知所措…..這也是社區中聚合操作幾乎每天都會被提問的原因。

本文基於官方文檔，梳理出聚合的以下幾個核心問題，目的:將Elasticsearch的聚合結合實際場景說透。

1、Elasticsearch聚合最直觀展示

區別於倒排索引的key value的全文檢索，聚合兩個示例如下：如下圖，是基於某特定分類的聚合統計結果。

如下圖：是基於月份的聚合統計結果。

2、Elasticsearch聚合定義

聚合是ES除了搜索功能外提供的針對ES數據做統計分析的功能。

搜索引擎的搜索部分側重於過濾和搜索，而聚合側重於數據統計和分析。

基本語法結構如下：

 1"aggregations" : {
 2    "<aggregation_name>" : {
 3        "<aggregation_type>" : {
 4            <aggregation_body>
 5        }
 6        [,"meta" : {  [<meta_data_body>] } ]?
 7        [,"aggregations" : { [<sub_aggregation>]+ } ]?
 8    }
 9    [,"<aggregation_name_2>" : { ... } ]*
10}

3、Elasticsearch聚合分類

3.1 分類1：Metric聚合

基於一組文檔進行聚合。所有的文檔在一個檢索集合裏，文檔被分成邏輯的分組。

類比Mysql中的： MIN(), MAX(), STDDEV(), SUM() 操作。

 1        單值Metric
 2                |
 3               v
 4SELECT AVG(price) FROM products
 5
 6
 7         多值Metric
 8          |          |
 9          v          v
10SELECT MIN(price), MAX(price) FROM products
11Metric聚合的DSL類比實現：
12{
13    "aggs":{
14        "avg_price":{
15            "avg":{
16                "field":"price"
17            }
18        }
19    }
20}

Metric聚合操作對比:

Aggregation	Elasticsearch	MySQL
Avg	Yes	Yes
Cardinality——去重唯一值	Yes (Sample based)	Yes (Exact)——類似：distinct
Extended Stats	Yes	StdDev bounds missing
Geo Bounds	Yes	for future blog post
Geo Centroid	Yes	for future blog post
Max	Yes	Yes
Percentiles	Yes	Complex SQL or UDF
Percentile Ranks	Yes	Complex SQL or UDF
Scripted	Yes	No
Stats	Yes	Yes
Top Hits——很重要，易被忽視	Yes	Complex
Value Count	Yes	Yes

其中，Top hits子聚合用於返回分組中Top X匹配結果集，且支持通過source過濾選定字段值。

分類2：Bucketing聚合

基於檢索構成了邏輯文檔組，滿足特定規則的文檔放置到一個桶裏，每一個桶關聯一個key。

類比Mysql中的group by操作，Mysql使用舉例：

1           基於size 分桶 ...、
2SELECT size COUNT(*) FROM products GROUP BY size 
3
4+----------------------+
5| size     |  COUNT(*) |
6+----------------------+ 
7| S        |   123     | <--- set of rows with size = S
8| M        |   456     |
9| ...      |  ...      |

bucket聚合的DSL類比實現：

 1{
 2  "query": {
 3    "match": {
 4      "title": "Beach"
 5    }
 6  },
 7  "aggs": {
 8    "by_size": {
 9      "terms": {
10        "field": "size"
11      }
12    },
13    "by_material": {
14      "terms": {
15        "field": "material"
16      }
17    }
18  }
19}

Bucketing聚合對比

Aggregation	Elasticsearch	MySQL
Childen——父子文檔	Yes	for future blog post
Date Histogram——基於時間分桶	Yes	Complex
Date Range	Yes	Complex
Filter	Yes	n/a (yes)
Filters	Yes	n/a (yes)
Geo Distance	Yes	for future blog post
GeoHash grid	Yes	for future blog post
Global	Yes	n/a (yes)
Histogram	Yes	Complex
IPv4 Range	Yes	Complex
Missing	Yes	Yes
Nested	Yes	for future blog post
Range	Yes	Complex
Reverse Nested	Yes	for future blog post
Sampler	Yes	Complex
Significant Terms	Yes	No
Terms——最常用	Yes	Yes

分類3：Pipeline聚合

對聚合的結果而不是原始數據集進行操作。

想象一下，你有一個日間交易的網上商店，想要了解所有產品的按照庫存日期分組的平均價格。

在SQL中你可以寫：

1SELECT in_stock_since, AVG(price) FROM products GROUP BY in_stock_since。

ES使用舉例：以下Demo實現更復雜，按月統計銷售額，並統計出月銷售額>200的信息。

下一節詳細給出DSL，不再重複。

分類4：Matrix聚合

ES6.4官網釋義：此功能是實驗性的，可在將來的版本中完全更改或刪除。

3、Elasticsearch聚合完整舉例

3.1 步驟1：動態Mapping，導入完整數據

 1POST _bulk
 2{"index":{"_index":"cars","_type":"doc","_id":"1"}}
 3{"name":"bmw","date":"2017-06-01", "color":"red", "price":30000}
 4{"index":{"_index":"cars","_type":"doc","_id":"2"}}
 5{"name":"bmw","date":"2017-06-30", "color":"blue", "price":50000}
 6{"index":{"_index":"cars","_type":"doc","_id":"3"}}
 7{"name":"bmw","date":"2017-08-11", "color":"red", "price":90000}
 8{"index":{"_index":"cars","_type":"doc","_id":"4"}}
 9{"name":"ford","date":"2017-07-15", "color":"red", "price":20000}
10{"index":{"_index":"cars","_type":"doc","_id":"5"}}
11{"name":"ford","date":"2017-07-01", "color":"blue", "price":40000}
12{"index":{"_index":"cars","_type":"doc","_id":"6"}}
13{"name":"bmw","date":"2017-08-01", "color":"green", "price":10000}
14{"index":{"_index":"cars","_type":"doc","_id":"7"}}
15{"name":"jeep","date":"2017-07-08", "color":"red", "price":110000}
16{"index":{"_index":"cars","_type":"doc","_id":"8"}}
17{"name":"jeep","date":"2017-08-25", "color":"red", "price":230000}

3.2 步驟2：確認Mapping

1GET cars/_mapping

3.3 步驟3：Matric聚合實現

求車的平均價錢。

 1POST cars/_search
 2{
 3  "size": 0,
 4  "aggs": {
 5    "avg_grade": {
 6      "avg": {
 7        "field": "price"
 8      }
 9    }
10  }
11}

3.4 步驟4：bucket聚合與子聚合實現

按照車品牌分組，組間按照車顏色再二次分組。

 1POST cars/_search
 2{
 3  "size": 0,
 4  "aggs": {
 5    "name_aggs": {
 6      "terms": {
 7        "field": "name.keyword"
 8      },
 9      "aggs": {
10        "color_aggs": {
11          "terms": {
12            "field": "color.keyword"
13          }
14        }
15      }
16    }
17  }
18}

3.5 步驟5：Pipeline聚合實現

按月統計銷售額，並統計出總銷售額大於200000的月份信息。

 1POST /cars/_search
 2{
 3  "size": 0,
 4  "aggs": {
 5    "sales_per_month": {
 6      "date_histogram": {
 7        "field": "date",
 8        "interval": "month"
 9      },
10      "aggs": {
11        "total_sales": {
12          "sum": {
13            "field": "price"
14          }
15        },
16        "sales_bucket_filter": {
17          "bucket_selector": {
18            "buckets_path": {
19              "totalSales": "total_sales"
20            },
21            "script": "params.totalSales > 200000"
22          }
23        }
24      }
25    }
26  }
27}

4、Elasticsearch聚合使用指南

認知前提：知道Elasticsearch聚合遠比Mysql中種類要多，可實現的功能點要多。遇到聚合問題，基於4個分類，查詢對應的官網API信息。以最常見場景爲例：

確定是否是分組group by 操作，如果是，使用bucket聚合中的terms聚合實現；
確定是否是按照時間分組操作，如果是，使用bucket聚合中date_histogram的聚合實現;
確定是否是分組，組間再分組操作，如果是，使用bucket聚合中terms聚合內部再terms或者內部top_hits子聚合實現;確定是否是分組，組間再分組操作，
確定是否是求最大值、最小值、平均值等，如果是,使用Metric聚合對應的Max, Min,AVG等聚合實現；
確定是否是基於聚合的結果條件進行判定後取結果，如果是，使用pipline聚合結合其他聚合綜合實現；

多嘗試，多在kibana的 dev tool部分多驗證。

參考： 1、http://t.cn/R8Gk7V0 2、http://t.cn/EhxwB63 3、http://t.cn/EhxwDKR

乾貨 | 通透理解Elasticsearch聚合

1、Elasticsearch聚合最直觀展示

2、Elasticsearch聚合定義

3、Elasticsearch聚合分類

3.1 分類1：Metric聚合

分類2：Bucketing聚合

分類3：Pipeline聚合

分類4：Matrix聚合

3、Elasticsearch聚合完整舉例

3.1 步驟1：動態Mapping，導入完整數據

3.2 步驟2：確認Mapping

3.3 步驟3：Matric聚合實現

3.4 步驟4：bucket聚合與子聚合實現

3.5 步驟5：Pipeline聚合實現

4、Elasticsearch聚合使用指南

Elasticsearch常見的5個錯誤及解決策略

乾貨 | 2018 Elastic 中國開發者大會筆記

圖解Elasticsearch之一——索引創建過程

Elasticsearch集羣管理之1——如何高效的添加、刪除節點？

爲什麼Elasticsearch查詢變得這麼慢了？

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結