淺談hive常用窗口函數

淺談hive常用窗口函數


目錄

淺談hive常用窗口函數

簡介

常用窗口函數

over

SUM,AVG,MIN,MAX

NTILE

ROW_NUMBER

RANK & DENSE_RANK

CUME_DIST&PERCENT_RANK

LAG

LEAD

FIRST_VALUE&LAST_VALUE


簡介

窗口函數又名開窗函數,屬於分析函數的一種,用於解決複雜報表統計需求的功能強大的函數。窗口函數用來計算基於組的某種聚合值,它和聚合函數的不同之處是:對於每個組返回多行,而聚合函數對於每個組只返回一行。

開窗函數指定了分析函數工作的數據窗口大小,這個數據窗口大小可能會隨着行的變化而變化。

常用窗口函數

over

  • over() 通常與聚合函數共同使用,比如 count()、sum()、min()、max()、avg() 等。
  • over() 具有一定的窗口語義 ,如:OVER(ROWS ((CURRENT ROW) | (UNBOUNDED) PRECEDING) AND (UNBOUNDED |(CURRENT ROW) ) FOLLOWING ),不過這些窗口定義經常與聚合函數(sum min max)相結合使用,像一些序列函數(row number、rank等)是不可以使用的
  • over() 直接使用時,通常是指定全量數據,當我們想要按某列的不同值進行窗口劃分時,可以在 over() 中加入 partition by 語句。

在單獨進行明細和count聚合的時候都會報錯,但是加上窗口就可以正常執行

select *,count(*)  from t_dw_orders_his
-------------------------------------------------------------------------------------------
Error while compiling statement: FAILED: SemanticException [Error 10025]: Expression not in GROUP BY key orderid

select *,count(*) over() from t_dw_orders_his  where p_event_date="2015-08-22"
//可以正常得到結果
10	2015-08-22	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22	17
9	2015-08-22	2015-08-22	創建	2015-08-22	9999-12-31	2015-08-22	17
8	2015-08-21	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22	17
8	2015-08-21	2015-08-21	創建	2015-08-21	2015-08-21	2015-08-22	17
7	2015-08-20	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-22	17
7	2015-08-20	2015-08-21	支付	2015-08-20	2015-08-20	2015-08-22	17
6	2015-08-20	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22	17
6	2015-08-20	2015-08-20	創建	2015-08-20	2015-08-21	2015-08-22	17
5	2015-08-19	2015-08-20	支付	2015-08-19	9999-12-31	2015-08-22	17
4	2015-08-19	2015-08-21	完成	2015-08-21	9999-12-31	2015-08-22	17
4	2015-08-19	2015-08-21	完成	2015-08-19	2015-08-20	2015-08-22	17
3	2015-08-19	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-22	17
3	2015-08-19	2015-08-21	支付	2015-08-19	2015-08-20	2015-08-22	17
2	2015-08-18	2015-08-22	完成	2015-08-22	9999-12-31	2015-08-22	17
2	2015-08-18	2015-08-18	創建	2015-08-18	2015-08-21	2015-08-22	17
1	2015-08-18	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22	17

SUM,AVG,MIN,MAX

此類聚合函數用戶類似,在此我們以SUM爲例結合OVER的窗口語句進行總結

準備數據

CREATE TABLE orders1(
  `orderid` int, 
  `createtime` string, 
  `money` int)
-----------------------
SELECT * FROM orders1
-----------------------
1	2015-08-18	72
1	2015-08-19	19
1	2015-08-20	67
1	2015-08-21	78
1	2015-08-22	62
1	2015-08-23	62

各種over參數情況下效果如下

SELECT orderid,
createtime,
money,
SUM(money) OVER() AS money1,
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC) AS money2, 
SUM(money) OVER(PARTITION BY orderid ORDER BY  createtime ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS money3, 
							
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS money4,
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS money5,   
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS money6   
FROM orders1;

結果如圖

總結

PRECEDING:往前數幾行
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點

特別注意當上面的那個demo如果createtime有重複值,則會意想不到的效果,結果如下面請參考

SELECT * FROM orders1
1	2015-08-18	72
1	2015-08-18	72
1	2015-08-19	78
1	2015-08-19	19
1	2015-08-19	72
1	2015-08-20	62
1	2015-08-20	62
1	2015-08-21	67
1	2015-08-22	78
1	2015-08-22	24
1	2015-08-23	67
1	2015-08-23	19
1	2015-08-23	19
同樣的執行下面這個sql
SELECT orderid,
createtime,
money,
SUM(money) OVER() AS money1,
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC) AS money2, 
SUM(money) OVER(PARTITION BY orderid ORDER BY  createtime ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS money3, 
							
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS money4,
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS money5,   
SUM(money) OVER(PARTITION BY orderid ORDER BY createtime ASC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS money6   
FROM orders1;

結果如下

NTILE

NTILE(n),切片函數,用於將分組數據按照順序切分成n片,返回當前切片值,如果切片不均勻,默認增加第一個切片的分佈

準備數據如下

CREATE TABLE orders1(
  `orderid` int, 
  `createtime` string, 
  `money` int)
-----------------------
SELECT * FROM orders1
-----------------------
1	2015-08-18	72
1	2015-08-19	19
1	2015-08-20	67
1	2015-08-21	78
1	2015-08-22	62
1	2015-08-23	62

執行sql效果如下

SELECT 
orderid,
createtime,
money,
NTILE(2) OVER(PARTITION BY orderid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY orderid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM orders1 

結果如下,可以注意下分成4個切片的情況,數據共有6組,分成4組切片的時候每組不足兩個,結果第三組和第四組各有1個

ROW_NUMBER

ROW_NUMBER() –從1開始,按照順序,生成分組內記錄的序列,這個函數是非常常用的一個窗口函數,應用場景非常廣泛,如在各種求日活月活的場景(配合where rn=1的用法比較多)

SELECT 
orderid,
createtime,
money,
row_number() OVER(PARTITION BY orderid ORDER BY createtime) AS rn
FROM orders1 

RANK & DENSE_RANK

—RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位,數字是不連續的
—DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位,數字是連續的

SELECT 
orderid,
createtime,
money,
rank() OVER(PARTITION BY orderid ORDER BY money) AS rn1,
dense_rank() OVER(PARTITION BY orderid ORDER BY money) AS rn2
FROM orders1 

CUME_DIST&PERCENT_RANK

–CUME_DIST 小於等於當前值的行數/分組內總行數
–比如,統計小於等於當前薪水的人數,所佔總人數的比例

–PERCENT_RANK 分組內當前行的RANK()函數值-1/分組內總行數-1

SELECT 
orderid,
createtime,
money,
cume_dist() OVER(PARTITION BY orderid ORDER BY money) AS rn1,
percent_rank() OVER(PARTITION BY orderid ORDER BY money) AS rn2,
rank() OVER(PARTITION BY orderid ORDER BY money) AS rn3
FROM orders1 

LAG

LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
第一個參數爲列名,第二個參數爲往上第n行(可選,默認爲1),第三個參數爲默認值(當往上第n行爲NULL時候,取默認值,如不指定,則爲NULL)

SELECT 
orderid,
createtime,
money,
lag(money,1,null) OVER(PARTITION BY orderid ORDER BY money) AS rn1,
lag(money,2,22) OVER(PARTITION BY orderid ORDER BY money) AS rn2,
lag(money,3,33) OVER(PARTITION BY orderid ORDER BY money) AS rn3
FROM orders1 

LEAD

與LAG相反
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
第一個參數爲列名,第二個參數爲往下第n行(可選,默認爲1),第三個參數爲默認值(當往下第n行爲NULL時候,取默認值,如不指定,則爲NULL)

SELECT 
orderid,
createtime,
money,
lead(money,1,null) OVER(PARTITION BY orderid ORDER BY money) AS rn1,
lead(money,2,22) OVER(PARTITION BY orderid ORDER BY money) AS rn2,
lead(money,3,33) OVER(PARTITION BY orderid ORDER BY money) AS rn3
FROM orders1 

FIRST_VALUE&LAST_VALUE

--FIRST_VALUE取分組內排序後,截止到當前行,第一個值

--LAST_VALUE取分組內排序後,截止到當前行,最後一個值

SELECT 
orderid,
createtime,
money,
first_value(money) OVER(PARTITION BY orderid ORDER BY money) AS rn1,
last_value(money) OVER(PARTITION BY orderid ORDER BY money) AS rn2,
first_value(money) OVER(PARTITION BY orderid ORDER BY money desc) AS rn11,
last_value(money) OVER(PARTITION BY orderid ORDER BY money desc) AS rn22
FROM orders1 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章