Hive之實現累加

一、需求

  1. 有如下數據
gifshow.com		2019/01/01		5
yy.com			2019/01/01		4
huya.com		2019/01/01		1
gifshow.com		2019/01/20		6
gifshow.com		2019/02/01		8
yy.com			2019/01/20		5
gifshow.com		2019/02/02		7
  1. 需要得到的結果
gifshow.com		2019-01		11.0		11.0
gifshow.com		2019-02		15.0		26.0
huya.com		2019-01		1.0			1.0
yy.com			2019-01		9.0			9.0

二、實現

(一)、日期的處理

  1. 實現UDF函數
public class DateFormat extends UDF {
    public String evaluate(String date) {
        String[] splits = date.split("/");
        return splits[0] + "-" + splits[1];
    }
}
  1. 打包上傳到hdfs
  2. 將函數註冊到Hive中
# 臨時
add jar /home/hadoop/lib/hdfs-train-1.0.jar;
CREATE TEMPORARY FUNCTION date_format_new AS "www.immoc.hive.udf.DateFormat";
# 永久
CREATE FUNCTION date_format_new AS "www.immoc.hive.udf.DateFormat" USING JAR 'hdfs://bigdata:9000/lib/hdfs-train-1.0.jar';
  1. 測試UDF
time	_c1
2019/01/01	2019-01
2019/01/01	2019-01
2019/01/01	2019-01
2019/01/20	2019-01
2019/02/01	2019-02
2019/01/20	2019-01
2019/02/02	2019-02

(二)、實現累加

  1. 先將每個domain在每個月的traffic進行統計
select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time);

#結果
gifshow.com		2019-01		11.0
gifshow.com		2019-02		15.0
huya.com		2019-01		1.0
yy.com			2019-01		9.0

  1. 將步驟1的結果進行自連接,由於是笛卡爾積,所以有的記錄是不對的,因此加上限定條件 t1.domain = t2.domain
select t1.*,t2.* from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain;

# 結果
gifshow.com		2019-01		11.0		gifshow.com		2019-01		11.0
gifshow.com		2019-01		11.0		gifshow.com		2019-02		15.0
gifshow.com		2019-02		15.0		gifshow.com		2019-01		11.0
gifshow.com		2019-02		15.0		gifshow.com		2019-02		15.0
huya.com		2019-01		1.0			huya.com		2019-01		1.0
yy.com			2019-01		9.0			yy.com			2019-01		9.0
  1. 此時進行累加,根據需求可以知道,
  • 根據t1.domain、t1.date進行分組,進而對t2.traffic進行求和。這樣每個分組對應的t2表的數據都是一樣的,那麼每個相同domain的分組對應的累加值就都是一樣的了,顯然不對。嘗試之後的結果如下
select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
group by t1.domain,t1.date;
# 報錯,FAILED: SemanticException [Error 10025]: Expression not in GROUP BY key traffic
# 需要將traffic字段也加到group by 之後,

select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
group by t1.domain,t1.date,t1.traffic;

# 結果,

gifshow.com	2019-01		11.0	26.0
gifshow.com	2019-02		15.0	26.0
huya.com	2019-01		1.0		1.0
yy.com		2019-01		9.0		9.0
  • 分析上面的結果,可以知道
# 一個分組
gifshow.com	2019-01	11.0	gifshow.com	2019-01	11.0
gifshow.com	2019-01	11.0	gifshow.com	2019-02	15.0

# 一個分組
gifshow.com	2019-02	15.0	gifshow.com	2019-01	11.0
gifshow.com	2019-02	15.0	gifshow.com	2019-02	15.0

# 一個分組
huya.com	2019-01	1.0	huya.com	2019-01	1.0

# 一個分組
yy.com	2019-01	9.0	yy.com	2019-01	9.0

可以清楚的發現,在統計一月份分組的數據的時候,多出來了一條二月的數據,所以可以加一個where條件,根據sql語句的執行順序,可以知道where在group by 之前執行,
所以加上條件where t1.date >= t2.date之後 ,分組情況

gifshow.com	2019-01	11.0	gifshow.com	2019-01	11.0

gifshow.com	2019-02	15.0	gifshow.com	2019-01	11.0
gifshow.com	2019-02	15.0	gifshow.com	2019-02	15.0

huya.com	2019-01	1.0	huya.com	2019-01	1.0

yy.com	2019-01	9.0	yy.com	2019-01	9.0
  • 最後結果
select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
where t1.date >= t2.date
group by t1.domain,t1.date,t1.traffic;

# 結果

gifshow.com		2019-01		11.0	11.0
gifshow.com		2019-02		15.0	26.0
huya.com		2019-01		1.0		1.0
yy.com			2019-01		9.0		9.0
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章