在工作中用hive進行數據統計的時候,遇到一個用group by 進行查詢的問題,需要統計的字段爲
gid,sid,user,roleid,time,status,map_id,num
其中time字段爲時間戳形式的,統計要求爲將各個字段按照每個小時的num總數進行統計
開始的時候寫的hive SQL爲
select gid,sid,user,roleid,time,status,map_id,sum(num) from test group by gid,sid,user,roleid,from_unixtime(time,'yyyyMMddHHmmss'),9,2),time,status,map_id;
在hive中執行後發現結果不對,hive是按照time字段進行的group by,於是將group by中的time字段去掉
select gid,sid,user,roleid,time,status,map_id,sum(num) from test group by gid,sid,user,roleid,from_unixtime(time,'yyyyMMddHHmmss'),9,2),status,map_id;
hive返回錯誤FAILED: Error in semantic analysis: Line 1:27 Expression not in GROUP BY key time
如果修改sql爲
select gid,sid,user,roleid,from_unixtime(time,'yyyyMMddHHmmss'),9,2),status,map_id,sum(num) from test group by gid,sid,user,roleid,from_unixtime(time,'yyyyMMddHHmmss'),9,2),status,map_id;
可以按照小時進行彙總統計,可是字段time不是想要顯示的結果,最後經過google查到方法
select gid,sid,user,roleid,collect_set(time)[0],status,map_id,sum(num) from test group by gid,sid,user,roleid,substr(from_unixtime(time,'yyyyMMddHHmmss'),9,2),status,map_id;
參考網址http://stackoverflow.com/questions/5746687/hive-expression-not-in-group-by-key
看來對hive的udf函數還是掌握的不夠,需要多學習