MySQL8.0之窗口函數
一、窗口函數簡介
1.1 什麼是窗口函數
MySQL從8.0開始支持窗口函數,這個功能在大多數據庫中早已支持,有的也叫分析函數。那麼什麼是窗口呢?
窗口的概念非常重要,它可以理解爲記錄集合,窗口函數也就是在滿足某種條件的記錄集合上執行的特殊函數。對於每條記錄都要在此窗口內執行函數,有的函數隨着記錄不同,窗口大小都是固定的,這種屬於靜態窗口;有的函數則相反,不同的記錄對應着不同的窗口,這種動態變化的窗口叫滑動窗口。簡單的說窗口函數就是對於查詢的每一行,都使用與該行相關的行進行計算。
窗口函數和普通聚合函數很容易混淆,二者區別如下:
- 聚合函數是將多條記錄聚合爲一條;而窗口函數是每條記錄都會執行,有幾條記錄執行完還是幾條。
- 聚合函數也可以用於窗口函數中。
1.2 窗口函數功能
名稱 | 描述 |
---|---|
CUME_DIST() | 計算一組值中一個值的累積分佈 |
DENSE_RANK() | 根據該ORDER BY子句爲分區中的每一行分配一個等級。它將相同的等級分配給具有相等值的行。如果兩行或更多行具有相同的排名,則排名值序列中將沒有間隙 |
FIRST_VALUE() | 返回相對於窗口框架第一行的指定表達式的值 |
LAG() | 返回分區中當前行之前的第N行的值。如果不存在前一行,則返回NULL |
LAST_VALUE() | 返回相對於窗口框架中最後一行的指定表達式的值 |
LEAD() | 返回分區中當前行之後的第N行的值。如果不存在後續行,則返回NULL |
NTH_VALUE() | 從窗口框架的第N行返回參數的值 |
NTILE() | 將每個窗口分區的行分配到指定數量的排名組中 |
PERCENT_RANK() | 計算分區或結果集中行的百分數等級 |
RANK() | 與DENSE_RANK()函數相似,不同之處在於當兩行或更多行具有相同的等級時,等級值序列中存在間隙 |
ROW_NUMBER() | 爲分區中的每一行分配一個順序整數 |
將上述函數按照功能劃分,可以把MySQL支持的窗口函數分爲如下幾類:
- 序號函數:ROW_NUMBER()、RANK()、DENSE_RANK()
- 分佈函數:PERCENT_RANK()、CUME_DIST()
- 前後函數:LAG()、LEAD()
- 頭尾函數:FIRST_VALUE()、LAST_VALUE()
- 其他函數:NTH_VALUE()、NTILE()
二、窗口函數語法
窗口函數的相關語法是:
[WINDOW window_name AS (window_spec)
[, window_name AS (window_spec)] ...]
window_function_name(window_name/expression)
OVER (
[partition_defintion]
[order_definition]
[frame_definition]
)
先指定作爲窗口函數的函數名,後面跟一個表達式,然後是OVER(…),就算OVER裏面沒有內容,括號也需要保留。
窗口函數的一個概念是當前行,當前行屬於某個窗口,窗口由“[partition_defintion]”,“[order_definition]”,“[frame_definition]“確定。
window_name:給窗口指定一個別名,如果SQL中涉及的窗口較多,採用別名可以看起來更清晰易讀。
partition_defintion:窗口按照指定字段進行分區,兩個分區由分區邊界分隔,窗口功能在分區內執行,並在跨越分區邊界時重新初始化。
order_definition:按照指定字段進行排序,窗口函數將按照排序後的記錄順序進行編號。可以和partition子句配合使用,也可以單獨使用。
frame子句:frame是當前分區的一個子集,在分區裏面再進一步細分窗口,子句用來定義子集的規則,通常用來作爲滑動窗口使用。
具體語法如下:
frame_unit {<frame_start>|<frame_between>}
frame_unit有兩種,分別是ROWS和RANGE,由ROWS定義的frame是由開始和結束位置的行確定的,由RANGE定義的frame由在某個值區間的行確定。
- 基於行:
通常使用BETWEEN frame_start AND frame_end語法來表示行範圍,frame_start和frame_end可以支持如下關鍵字,來確定不同的動態行記錄:
CURRENT ROW 邊界是當前行,一般和其他範圍關鍵字一起使用
UNBOUNDED PRECEDING 邊界是分區中的第一行
UNBOUNDED FOLLOWING 邊界是分區中的最後一行
expr PRECEDING 當前行之前的expr(數字或表達式)行
expr FOLLOWING 當前行之後的expr(數字或表達式)行
比如,下面都是合法的範圍:
rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口範圍是當前行、前一行、後一行一共三行記錄。
rows UNBOUNDED FOLLOWING 窗口範圍是當前行到分區中的最後一行
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口範圍是當前分區中所有行,等同於不寫。
- 基於範圍:
和基於行類似,但有些範圍不是直接可以用行數來表示的,比如希望窗口範圍是一週前的訂單開始,截止到當前行,則無法使用rows來直接表示,此時就可以使用範圍來表示窗口:INTERVAL 7 DAY PRECEDING。Linux中常見的最近1分鐘、5分鐘負載是一個典型的應用場景。
如果未frame_definition在OVER子句中指定,則MySQL默認使用以下框架:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
三、窗口函數示例
MySQL [test]> select * from t1;
+----+------+--------+---------------------+
| id | name | amount | time |
+----+------+--------+---------------------+
| 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 |
+----+------+--------+---------------------+
10 rows in set (0.00 sec)
MySQL [test]> select id,name,amount,time,avg(amount) over w as avg_sum from t1 window w as (partition by name order by time desc rows BETWEEN 1 PRECEDING AND 1 FOLLOWING);
+----+------+--------+---------------------+----------+
| id | name | amount | time | avg_sum |
+----+------+--------+---------------------+----------+
| 5 | a1 | 300 | 2019-01-04 00:00:00 | 300.0000 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 | 266.6667 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 | 266.6667 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 | 200.0000 |
| 1 | a1 | 100 | 2019-01-01 00:00:00 | 200.0000 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 | 566.6667 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 | 550.0000 |
+----+------+--------+---------------------+----------+
10 rows in set (0.00 sec)
#從結果可以看出,id爲5的記錄屬於邊界值,沒有前一行,因此avg_sum爲(300+300)/2=300;id爲4的記錄前後都有記錄,所以avg_sum爲(300+300+200)/3=266.6667,以此類推就可以得到一個基於滑動窗口的動態平均值。此例中,窗口函數用到了傳統的聚合函數avg(),用來計算動態的平均值。
MySQL [test]> select CUME_DIST() over (partition by name order by time desc ) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0.2 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.4 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.8 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.8 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0.6 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.8 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#在某種排序條件下,小於等於當前行值的行數/總行數,得到的是數據在某一個緯度的分佈百分比情況
MySQL [test]> select DENSE_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 3 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.01 sec)
#dense_rank()的出現是爲了解決rank()編號存在的問題的,rank()編號的時候存在跳號的問題,如果有兩個並列第1,那麼下一個名次的編號就是3,結果就是沒有編號爲2的數據。如果不想跳號,可以使用dense_rank()替代
MySQL [test]> select FIRST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-04 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#first_value就是取某一組數據,按照某種方式排序的,最早的一個字段的值。
MySQL [test]> select LAG(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| NULL | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-06 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lag(column,n)獲取當前數據行按照某種排序規則的上n行數據的某個字段
MySQL [test]> select LAST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-05 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#last_value就是取某一組數據,按照某種方式排序的,最新的一個字段的值。
MySQL [test]> select LEAD(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-03 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-02 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| NULL | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-05 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| NULL | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lead(column,n)獲取當前數據行按照某種排序規則的下n行數據的某個字段
MySQL [test]> select NTH_VALUE(time,2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#從排序的第n行還是返回nth_value字段中的值
MySQL [test]> select NTILE(2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 1 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 1 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#按照某列的倒序排列,將字段分成N組,可以得到哪個數據在N組中哪一部分
MySQL [test]> select PERCENT_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.25 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.5 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.5 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.75 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#數據分佈的計算方式:當前RANK值-1/總行數-1
MySQL [test]> select RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#排序條件一樣的情況下,其編號也一樣。
MySQL [test]> select ROW_NUMBER() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 4 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 3 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#對排序結果編號
四、總結
MySQL8.0中加入了窗口函數的功能,這一點方便了SQL的編寫,可以說是MySQL8.0的亮點之一。