MySQL8.0之窗口函數

一、窗口函數簡介

1.1 什麼是窗口函數

  MySQL從8.0開始支持窗口函數,這個功能在大多數據庫中早已支持,有的也叫分析函數。那麼什麼是窗口呢?
  窗口的概念非常重要,它可以理解爲記錄集合,窗口函數也就是在滿足某種條件的記錄集合上執行的特殊函數。對於每條記錄都要在此窗口內執行函數,有的函數隨着記錄不同,窗口大小都是固定的,這種屬於靜態窗口;有的函數則相反,不同的記錄對應着不同的窗口,這種動態變化的窗口叫滑動窗口。簡單的說窗口函數就是對於查詢的每一行,都使用與該行相關的行進行計算。
  窗口函數和普通聚合函數很容易混淆,二者區別如下:

  • 聚合函數是將多條記錄聚合爲一條;而窗口函數是每條記錄都會執行,有幾條記錄執行完還是幾條。
  • 聚合函數也可以用於窗口函數中。

1.2 窗口函數功能

名稱 描述
CUME_DIST() 計算一組值中一個值的累積分佈
DENSE_RANK() 根據該ORDER BY子句爲分區中的每一行分配一個等級。它將相同的等級分配給具有相等值的行。如果兩行或更多行具有相同的排名,則排名值序列中將沒有間隙
FIRST_VALUE() 返回相對於窗口框架第一行的指定表達式的值
LAG() 返回分區中當前行之前的第N行的值。如果不存在前一行,則返回NULL
LAST_VALUE() 返回相對於窗口框架中最後一行的指定表達式的值
LEAD() 返回分區中當前行之後的第N行的值。如果不存在後續行,則返回NULL
NTH_VALUE() 從窗口框架的第N行返回參數的值
NTILE() 將每個窗口分區的行分配到指定數量的排名組中
PERCENT_RANK() 計算分區或結果集中行的百分數等級
RANK() 與DENSE_RANK()函數相似,不同之處在於當兩行或更多行具有相同的等級時,等級值序列中存在間隙
ROW_NUMBER() 爲分區中的每一行分配一個順序整數

  將上述函數按照功能劃分,可以把MySQL支持的窗口函數分爲如下幾類:

  • 序號函數:ROW_NUMBER()、RANK()、DENSE_RANK()
  • 分佈函數:PERCENT_RANK()、CUME_DIST()
  • 前後函數:LAG()、LEAD()
  • 頭尾函數:FIRST_VALUE()、LAST_VALUE()
  • 其他函數:NTH_VALUE()、NTILE()

二、窗口函數語法

  窗口函數的相關語法是:

[WINDOW window_name AS (window_spec)
    [, window_name AS (window_spec)] ...]

window_function_name(window_name/expression) 
    OVER (
        [partition_defintion]
        [order_definition]
        [frame_definition]
    )

  先指定作爲窗口函數的函數名,後面跟一個表達式,然後是OVER(…),就算OVER裏面沒有內容,括號也需要保留。
  窗口函數的一個概念是當前行,當前行屬於某個窗口,窗口由“[partition_defintion]”,“[order_definition]”,“[frame_definition]“確定。
  window_name:給窗口指定一個別名,如果SQL中涉及的窗口較多,採用別名可以看起來更清晰易讀。
  partition_defintion:窗口按照指定字段進行分區,兩個分區由分區邊界分隔,窗口功能在分區內執行,並在跨越分區邊界時重新初始化。
  order_definition:按照指定字段進行排序,窗口函數將按照排序後的記錄順序進行編號。可以和partition子句配合使用,也可以單獨使用。
  frame子句:frame是當前分區的一個子集,在分區裏面再進一步細分窗口,子句用來定義子集的規則,通常用來作爲滑動窗口使用。
  具體語法如下:

frame_unit {<frame_start>|<frame_between>}

  frame_unit有兩種,分別是ROWS和RANGE,由ROWS定義的frame是由開始和結束位置的行確定的,由RANGE定義的frame由在某個值區間的行確定。
_

  • 基於行:
      通常使用BETWEEN frame_start AND frame_end語法來表示行範圍,frame_start和frame_end可以支持如下關鍵字,來確定不同的動態行記錄:
CURRENT ROW 邊界是當前行,一般和其他範圍關鍵字一起使用
UNBOUNDED PRECEDING 邊界是分區中的第一行
UNBOUNDED FOLLOWING 邊界是分區中的最後一行
expr PRECEDING  當前行之前的expr(數字或表達式)行
expr FOLLOWING  當前行之後的expr(數字或表達式)行
比如,下面都是合法的範圍:
rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口範圍是當前行、前一行、後一行一共三行記錄。
rows  UNBOUNDED FOLLOWING 窗口範圍是當前行到分區中的最後一行
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口範圍是當前分區中所有行,等同於不寫。
  • 基於範圍:
      和基於行類似,但有些範圍不是直接可以用行數來表示的,比如希望窗口範圍是一週前的訂單開始,截止到當前行,則無法使用rows來直接表示,此時就可以使用範圍來表示窗口:INTERVAL 7 DAY PRECEDING。Linux中常見的最近1分鐘、5分鐘負載是一個典型的應用場景。
      如果未frame_definition在OVER子句中指定,則MySQL默認使用以下框架:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

三、窗口函數示例

MySQL [test]> select * from t1;
+----+------+--------+---------------------+
| id | name | amount | time                |
+----+------+--------+---------------------+
|  1 | a1   |    100 | 2019-01-01 00:00:00 |
|  2 | a1   |    200 | 2019-01-02 00:00:00 |
|  3 | a1   |    300 | 2019-01-02 00:00:00 |
|  4 | a1   |    300 | 2019-01-03 00:00:00 |
|  5 | a1   |    300 | 2019-01-04 00:00:00 |
|  6 | a2   |    500 | 2019-01-05 00:00:00 |
|  7 | a2   |    600 | 2019-01-06 00:00:00 |
|  8 | a2   |    600 | 2019-01-07 00:00:00 |
|  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 10 | a2   |    600 | 2019-01-07 00:00:00 |
+----+------+--------+---------------------+
10 rows in set (0.00 sec)

MySQL [test]> select id,name,amount,time,avg(amount) over w as avg_sum from t1 window w as (partition by name order by time desc rows BETWEEN 1 PRECEDING AND 1 FOLLOWING);
+----+------+--------+---------------------+----------+
| id | name | amount | time                | avg_sum  |
+----+------+--------+---------------------+----------+
|  5 | a1   |    300 | 2019-01-04 00:00:00 | 300.0000 |
|  4 | a1   |    300 | 2019-01-03 00:00:00 | 266.6667 |
|  2 | a1   |    200 | 2019-01-02 00:00:00 | 266.6667 |
|  3 | a1   |    300 | 2019-01-02 00:00:00 | 200.0000 |
|  1 | a1   |    100 | 2019-01-01 00:00:00 | 200.0000 |
|  8 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
|  9 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
| 10 | a2   |    600 | 2019-01-07 00:00:00 | 600.0000 |
|  7 | a2   |    600 | 2019-01-06 00:00:00 | 566.6667 |
|  6 | a2   |    500 | 2019-01-05 00:00:00 | 550.0000 |
+----+------+--------+---------------------+----------+
10 rows in set (0.00 sec)
#從結果可以看出,id爲5的記錄屬於邊界值,沒有前一行,因此avg_sum爲(300+300)/2=300;id爲4的記錄前後都有記錄,所以avg_sum爲(300+300+200)/3=266.6667,以此類推就可以得到一個基於滑動窗口的動態平均值。此例中,窗口函數用到了傳統的聚合函數avg(),用來計算動態的平均值。
MySQL [test]> select CUME_DIST() over (partition by name order by time desc ) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|     0.2 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|     0.4 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|     0.8 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|     0.8 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       1 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|     0.6 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.6 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.6 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|     0.8 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       1 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#在某種排序條件下,小於等於當前行值的行數/總行數,得到的是數據在某一個緯度的分佈百分比情況
MySQL [test]> select DENSE_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       3 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       4 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       3 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.01 sec)
#dense_rank()的出現是爲了解決rank()編號存在的問題的,rank()編號的時候存在跳號的問題,如果有兩個並列第1,那麼下一個名次的編號就是3,結果就是沒有編號爲2的數據。如果不想跳號,可以使用dense_rank()替代
MySQL [test]> select FIRST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-04 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#first_value就是取某一組數據,按照某種方式排序的,最早的一個字段的值。
MySQL [test]> select LAG(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| NULL                |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| NULL                |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-06 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lag(column,n)獲取當前數據行按照某種排序規則的上n行數據的某個字段
MySQL [test]> select LAST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-05 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#last_value就是取某一組數據,按照某種方式排序的,最新的一個字段的值。
MySQL [test]> select LEAD(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| 2019-01-03 00:00:00 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-02 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| NULL                |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-05 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| NULL                |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lead(column,n)獲取當前數據行按照某種排序規則的下n行數據的某個字段
MySQL [test]> select NTH_VALUE(time,2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun             | id | name | amount | time                |
+---------------------+----+------+--------+---------------------+
| NULL                |  5 | a1   |    300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#從排序的第n行還是返回nth_value字段中的值
MySQL [test]> select NTILE(2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       1 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       1 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       2 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       2 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       2 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#按照某列的倒序排列,將字段分成N組,可以得到哪個數據在N組中哪一部分
MySQL [test]> select PERCENT_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       0 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|    0.25 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|     0.5 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|     0.5 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       1 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       0 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       0 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       0 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|    0.75 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       1 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#數據分佈的計算方式:當前RANK值-1/總行數-1 
MySQL [test]> select RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       3 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       5 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       1 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       4 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       5 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#排序條件一樣的情況下,其編號也一樣。
MySQL [test]> select ROW_NUMBER() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time                |
+---------+----+------+--------+---------------------+
|       1 |  5 | a1   |    300 | 2019-01-04 00:00:00 |
|       2 |  4 | a1   |    300 | 2019-01-03 00:00:00 |
|       3 |  2 | a1   |    200 | 2019-01-02 00:00:00 |
|       4 |  3 | a1   |    300 | 2019-01-02 00:00:00 |
|       5 |  1 | a1   |    100 | 2019-01-01 00:00:00 |
|       1 |  8 | a2   |    600 | 2019-01-07 00:00:00 |
|       2 |  9 | a2   |    600 | 2019-01-07 00:00:00 |
|       3 | 10 | a2   |    600 | 2019-01-07 00:00:00 |
|       4 |  7 | a2   |    600 | 2019-01-06 00:00:00 |
|       5 |  6 | a2   |    500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#對排序結果編號

四、總結

  MySQL8.0中加入了窗口函數的功能,這一點方便了SQL的編寫,可以說是MySQL8.0的亮點之一。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章