一、需求
- 有如下數據
gifshow.com 2019/01/01 5
yy.com 2019/01/01 4
huya.com 2019/01/01 1
gifshow.com 2019/01/20 6
gifshow.com 2019/02/01 8
yy.com 2019/01/20 5
gifshow.com 2019/02/02 7
- 需要得到的結果
gifshow.com 2019-01 11.0 11.0
gifshow.com 2019-02 15.0 26.0
huya.com 2019-01 1.0 1.0
yy.com 2019-01 9.0 9.0
二、實現
(一)、日期的處理
- 實現UDF函數
public class DateFormat extends UDF {
public String evaluate(String date) {
String[] splits = date.split("/");
return splits[0] + "-" + splits[1];
}
}
- 打包上傳到hdfs
- 將函數註冊到Hive中
# 臨時
add jar /home/hadoop/lib/hdfs-train-1.0.jar;
CREATE TEMPORARY FUNCTION date_format_new AS "www.immoc.hive.udf.DateFormat";
# 永久
CREATE FUNCTION date_format_new AS "www.immoc.hive.udf.DateFormat" USING JAR 'hdfs://bigdata:9000/lib/hdfs-train-1.0.jar';
- 測試UDF
time _c1
2019/01/01 2019-01
2019/01/01 2019-01
2019/01/01 2019-01
2019/01/20 2019-01
2019/02/01 2019-02
2019/01/20 2019-01
2019/02/02 2019-02
(二)、實現累加
- 先將每個domain在每個月的traffic進行統計
select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time);
#結果
gifshow.com 2019-01 11.0
gifshow.com 2019-02 15.0
huya.com 2019-01 1.0
yy.com 2019-01 9.0
- 將步驟1的結果進行自連接,由於是笛卡爾積,所以有的記錄是不對的,因此加上限定條件
t1.domain = t2.domain
select t1.*,t2.* from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain;
# 結果
gifshow.com 2019-01 11.0 gifshow.com 2019-01 11.0
gifshow.com 2019-01 11.0 gifshow.com 2019-02 15.0
gifshow.com 2019-02 15.0 gifshow.com 2019-01 11.0
gifshow.com 2019-02 15.0 gifshow.com 2019-02 15.0
huya.com 2019-01 1.0 huya.com 2019-01 1.0
yy.com 2019-01 9.0 yy.com 2019-01 9.0
- 此時進行累加,根據需求可以知道,
- 根據t1.domain、t1.date進行分組,進而對t2.traffic進行求和。這樣每個分組對應的t2表的數據都是一樣的,那麼每個相同domain的分組對應的累加值就都是一樣的了,顯然不對。嘗試之後的結果如下
select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
group by t1.domain,t1.date;
# 報錯,FAILED: SemanticException [Error 10025]: Expression not in GROUP BY key traffic
# 需要將traffic字段也加到group by 之後,
select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
group by t1.domain,t1.date,t1.traffic;
# 結果,
gifshow.com 2019-01 11.0 26.0
gifshow.com 2019-02 15.0 26.0
huya.com 2019-01 1.0 1.0
yy.com 2019-01 9.0 9.0
- 分析上面的結果,可以知道
# 一個分組
gifshow.com 2019-01 11.0 gifshow.com 2019-01 11.0
gifshow.com 2019-01 11.0 gifshow.com 2019-02 15.0
# 一個分組
gifshow.com 2019-02 15.0 gifshow.com 2019-01 11.0
gifshow.com 2019-02 15.0 gifshow.com 2019-02 15.0
# 一個分組
huya.com 2019-01 1.0 huya.com 2019-01 1.0
# 一個分組
yy.com 2019-01 9.0 yy.com 2019-01 9.0
可以清楚的發現,在統計一月份分組的數據的時候,多出來了一條二月的數據,所以可以加一個where條件,根據sql語句的執行順序,可以知道where在group by 之前執行,
所以加上條件where t1.date >= t2.date
之後 ,分組情況
gifshow.com 2019-01 11.0 gifshow.com 2019-01 11.0
gifshow.com 2019-02 15.0 gifshow.com 2019-01 11.0
gifshow.com 2019-02 15.0 gifshow.com 2019-02 15.0
huya.com 2019-01 1.0 huya.com 2019-01 1.0
yy.com 2019-01 9.0 yy.com 2019-01 9.0
- 最後結果
select t1.*,sum(t2.traffic) as total from
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t1
join
(select domain,date_format_new(time) as date,sum(traffic) as traffic from visits group by domain,date_format_new(time)) as t2
on
t1.domain = t2.domain
where t1.date >= t2.date
group by t1.domain,t1.date,t1.traffic;
# 結果
gifshow.com 2019-01 11.0 11.0
gifshow.com 2019-02 15.0 26.0
huya.com 2019-01 1.0 1.0
yy.com 2019-01 9.0 9.0