今天介紹一下druid常用查詢類型的TopN和GroupBy。
實踐服務器配置:
CPU:24核+SSD 1.4T Flush卡*2+內存256G+千兆網卡 * 15臺
數據量:4億+/天,連續100天數據,數據總量400億+-
一、數據分類:
Timestamp | 將時間相近的一些數據聚合在一起,查詢的時候指定時間範圍,底層使用絕對毫秒數保存的時間戳,默認使用ISO-8601格式展示時間,格式:YYYY-MM-DDTHH:MM:SSSZ,“Z“代表零時區,中國所在的東八區爲:“+08:00” |
Dimensions | 與OLAP中的維度是一樣的,一條記錄中的字符型數據可看做是維度列,維度列被用於過濾篩選數據,分組數據等,標識一些統計的維度,比如:名稱、類別等 |
Metrics | 質量列被用於聚合和計算的列,比如:訪問總數、合計金額等 |
二、查詢介紹:
Timeseries |
按照指定日期時間段查詢規則返回聚合合的結果集,查詢規則中可以設置查詢粒度、維度,過濾條件和排序方式,並且支持後聚合。返回一個JSON對象數組。 |
TopN | 適用於單維度查詢,通過給定規則和指標維度返回一個結果集,TopN查詢可看做是給定排序規則,返回單一維度的Group By查詢,比實際Group By查詢查詢快很多,Metic屬性是TopN專屬的指標排序列。 |
GroupBy | 適用於多維度查詢,使用方面是最靈活也最耗性能。 |
從性能上來講Timeseries和TopN查詢性能優於GroupBy。
三、參數解釋:
aggregation | 對應“SELECT X,Y FROM”,要查詢的某些列 |
dimensions | 對應“GROUP BY X,Y”,基於哪些列分組 |
filter | 對應“WHERE X=1 AND|OR Y=2”,過濾條件 |
granularity |
數據聚合的粒度 |
四、TopN案例:
1、案例需求:
- 用戶名:“Jack”
- 維度:“City”
- 日期範圍:2020-03-01至2020-03-31
- 按“rechange”排序取前三條記錄
- SQL表達:
SELECT
SUM(rechange) AS rechange, city, user_name
FROM
user_rechange_recode
WHERE
dt>='2020-03-01' AND dt<='2020-03-31' AND user_name='Jack'
GROUP BY
city
ORDER BY
recange DESC
LIMIT 3
2、查詢請求:
{
"queryType": "topN",
"dataSource": "user_recharge_record",
"granularity": "all",
"filter": {
"type": "and",
"fields": [{
"type": "selector",
"dimension": "user_name",
"value": "Jack"
}]
},
"dimension": "city",
"threshold": "3", //返回的TOP行數
"metric": "rechange",
"aggregations": [
{
"name": "recharge",
"type": "doubleSum",
"fieldName": "recharge"
}
],
"intervals": "2020-03-01/2020-04-01",
"context": {
"timeout": 28000
}
}
3、返回結果:
[
{
"timestamp": "2020-02-29T16:00:00.000Z",
"result": [
{
"recharge": 661.61,
"city": "北京"
},
{
"recharge": 503.48,
"city": "上海"
},
{
"recharge": 362.59,
"city": "廣州"
}
]
}
]
4、響應時間:
Time:60ms+-
五、GroupBy案例:
1、案例需求:
- 用戶名:“Jom”
- 維度:“City”,“UserName”
- 日期範圍:2020-03-01至2020-03-31
- 按“rechange”排序取前三條記錄
- SQL表達:
SELECT
user_name, city, level, SUM(recharge) AS recharge
FROM
user_recharge_record
WHERE
dt>='2020-01-01' AND dt<='2020-03-31' AND user_name='Jom' AND level in(...)
GROUP BY
user_name, city, level
ORDER BY
recharge DESC
LIMIT 3
2、查詢請求:
{
"dataSource": "user_recharge_record", // 數據源
"queryType": "groupBy", // 查詢類型
"granularity": "all", // 所有
"intervals": "2020-01-01/2020-04-01", // 查詢區間
"filter": { // 條件類似SQL:select ... WHERE user_name='Jom' AND level in ("...")
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "user_name",
"value": "Jom"
},
{
"type": "in",
"dimension": "level",
"values": [
"工程師", "項目經理", "研發主管"
]
}
]
},
"limitSpec": {
"type": "default",
"limit": 3
},
"context": {
"timeout": 28000
},
"dimensions": ["user_name", "city", "level"],// 類似 SELECT user_name, city, level FROM ...
"aggregations": [
{
"name": "recharge",
"type": "doubleSum",
"fieldName": "recharge"
}
]
}
3、返回返回:
[
{
"version": "v1",
"timestamp": "2020-01-01T00:00:00.000+08:00",
"event": {
"city": "jingdong",
"level": "天津",
"user_name": "Jom",
"rechange": 14563.0
}
},
{
"version": "v1",
"timestamp": "2020-01-01T00:00:00.000+08:00",
"event": {
"city": "上海",
"level": "工程師",
"user_name": "Jom",
"rechange": 6799.0
}
},
{
"version": "v1",
"timestamp": "2020-01-01T00:00:00.000+08:00",
"event": {
"city": "北京",
"level": "研發主管",
"user_name": "Jom",
"views": 165.0
}
}
]
4、響應時間:
Time:800ms+-
結論:
如上實例均爲50~800ms以內返回數據。
Timeseries、TopN查詢速度最快,GroupBy多維度分析較慢。