1、Hive系統內置函數

1.1、數值計算函數

1、取整函數: round

語法: round(double a)
返回值: BIGINT
說明: 返回double類型的整數值部分（遵循四捨五入）

hive> select round(3.1415926) from tableName;
3
hive> select round(3.5) from tableName;
4
hive> create table tableName as select round(9542.158) from tableName;

2、指定精度取整函數: round

語法: round(double a, int d)
返回值: DOUBLE
說明: 返回指定精度d的double類型

hive> select round(3.1415926,4) from tableName;
3.1416

3、向下取整函數: floor

語法: floor(double a)
返回值: BIGINT
說明: 返回等於或者小於該double變量的最大的整數

hive> select floor(3.1415926) from tableName;
3
hive> select floor(25) from tableName;
25

4、向上取整函數: ceil

語法: ceil(double a)
返回值: BIGINT
說明: 返回等於或者大於該double變量的最小的整數

hive> select ceil(3.1415926) from tableName;
4
hive> select ceil(46) from tableName;
46

5、向上取整函數: ceiling

語法: ceiling(double a)
返回值: BIGINT
說明: 與ceil功能相同

hive> select ceiling(3.1415926) from tableName;
4
hive> select ceiling(46) from tableName;
46

6、取隨機數函數: rand

語法: rand(),rand(int seed)
返回值: double
說明: 返回一個0到1範圍內的隨機數。如果指定種子seed，則會等到一個穩定的隨機數序列

hive> select rand() from tableName;
0.5577432776034763
hive> select rand() from tableName;
0.6638336467363424
hive> select rand(100) from tableName;
0.7220096548596434
hive> select rand(100) from tableName;
0.7220096548596434

1.2、日期函數

1、UNIX時間戳轉日期函數: from_unixtime

語法: from_unixtime(bigint unixtime[, string format])
返回值: string
說明: 轉化UNIX時間戳（從1970-01-01 00:00:00 UTC到指定時間的秒數）到當前時區的時間格式

hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName;
20111208

2、獲取當前UNIX時間戳函數: unix_timestamp

語法: unix_timestamp()
返回值: bigint
說明: 獲得當前時區的UNIX時間戳

hive> select unix_timestamp() from tableName;
1323309615

3、日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date)
返回值: bigint
說明: 轉換格式爲"yyyy-MM-dd HH:mm:ss"的日期到UNIX時間戳。如果轉化失敗，則返回0。

hive> select unix_timestamp('2011-12-07 13:01:03') from tableName;
1323234063

4、指定格式日期轉UNIX時間戳函數: unix_timestamp

語法: unix_timestamp(string date, string pattern)
返回值: bigint
說明: 轉換pattern格式的日期到UNIX時間戳。如果轉化失敗，則返回0。

hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;
1323234063

5、日期時間轉日期函數: to_date

語法: to_date(string timestamp)
返回值: string
說明: 返回日期時間字段中的日期部分。

hive> select to_date('2011-12-08 10:03:01') from tableName;
2011-12-08

6、日期轉年函數: year

語法: year(string date)
返回值: int
說明: 返回日期中的年。

hive> select year('2011-12-08 10:03:01') from tableName;
2011
hive> select year('2012-12-08') from tableName;
2012

7、日期轉月函數: month

語法: month (string date)
返回值: int
說明: 返回日期中的月份。

hive> select month('2011-12-08 10:03:01') from tableName;
12
hive> select month('2011-08-08') from tableName;
8

8、日期轉天函數: day

語法: day (string date)
返回值: int
說明: 返回日期中的天。

hive> select day('2011-12-08 10:03:01') from tableName;
8
hive> select day('2011-12-24') from tableName;
24

9、日期轉小時函數: hour

語法: hour (string date)
返回值: int
說明: 返回日期中的小時。

hive> select hour('2011-12-08 10:03:01') from tableName;
10

10、日期轉分鐘函數: minute

語法: minute (string date)
返回值: int
說明: 返回日期中的分鐘。

hive> select minute('2011-12-08 10:03:01') from tableName;
3

hive> select second('2011-12-08 10:03:01') from tableName;
1

12、日期轉周函數: weekofyear

語法: weekofyear (string date)
返回值: int
說明: 返回日期在當前的週數。

hive> select weekofyear('2011-12-08 10:03:01') from tableName;
49

13、日期比較函數: datediff

語法: datediff(string enddate, string startdate)
返回值: int
說明: 返回結束日期減去開始日期的天數。

hive> select datediff('2012-12-08','2012-05-09') from tableName;
213

14、日期增加函數: date_add

語法: date_add(string startdate, int days)
返回值: string
說明: 返回開始日期startdate增加days天后的日期。

hive> select date_add('2012-12-08',10) from tableName;
2012-12-18

15、日期減少函數: date_sub

語法: date_sub (string startdate, int days)
返回值: string
說明: 返回開始日期startdate減少days天后的日期。

hive> select date_sub('2012-12-08',10) from tableName;
2012-11-28

1.3、條件函數

1、If函數: if

語法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
說明: 當條件testCondition爲TRUE時，返回valueTrue；否則返回valueFalseOrNull

hive> select if(1=2,100,200) from tableName;
200
hive> select if(1=1,100,200) from tableName;
100

2、非空查找函數: COALESCE

語法: COALESCE(T v1, T v2, …)
返回值: T
說明: 返回參數中的第一個非空值；如果所有值都爲NULL，那麼返回NULL

hive> select COALESCE(null,'100','50') from tableName;
100

3、條件判斷函數：CASE

語法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值: T
說明：如果a等於b，那麼返回c；如果a等於d，那麼返回e；否則返回f

hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim

4、條件判斷函數：CASE

語法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
返回值: T
說明：如果a爲TRUE,則返回b；如果c爲TRUE，則返回d；否則返回e

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

1.4、字符串函數

1、字符串長度函數：length

語法: length(string A)
返回值: int
說明：返回字符串A的長度

hive> select length('abcedfg') from tableName;

2、字符串反轉函數：reverse

語法: reverse(string A)
返回值: string
說明：返回字符串A的反轉結果

hive> select reverse('abcedfg') from tableName;
gfdecba

3、字符串連接函數：concat

語法: concat(string A, string B…)
返回值: string
說明：返回輸入字符串連接後的結果，支持任意個輸入字符串

hive> select concat('abc','def','gh') from tableName;
abcdefgh

4、字符串連接並指定字符串分隔符：concat_ws

語法: concat_ws(string SEP, string A, string B…)
返回值: string
說明：返回輸入字符串連接後的結果，SEP表示各個字符串間的分隔符

hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

5、字符串截取函數：substr

語法: substr(string A, int start),substring(string A, int start)
返回值: string
說明：返回字符串A從start位置到結尾的字符串

hive> select substr('abcde',3) from tableName;
cde
hive> select substring('abcde',3) from tableName;
cde
hive>  select substr('abcde',-1) from tableName;  （和ORACLE相同）
e

6、字符串截取函數：substr,substring

語法: substr(string A, int start, int len),substring(string A, int start, int len)
返回值: string
說明：返回字符串A從start位置開始，長度爲len的字符串

hive> select substr('abcde',3,2) from tableName;
cd
hive> select substring('abcde',3,2) from tableName;
cd
hive>select substring('abcde',-2,2) from tableName;
de

7、字符串轉大寫函數：upper,ucase

語法: upper(string A) ucase(string A)
返回值: string
說明：返回字符串A的大寫格式

hive> select upper('abSEd') from tableName;
ABSED
hive> select ucase('abSEd') from tableName;
ABSED

8、字符串轉小寫函數：lower,lcase

語法: lower(string A) lcase(string A)
返回值: string
說明：返回字符串A的小寫格式

hive> select lower('abSEd') from tableName;
absed
hive> select lcase('abSEd') from tableName;
absed

9、去空格函數：trim

語法: trim(string A)
返回值: string
說明：去除字符串兩邊的空格

hive> select trim(' abc ') from tableName;
abc

10、url解析函數 parse_url

語法:
parse_url(string urlString, string partToExtract [, string keyToExtract])
返回值: string
說明：返回URL中指定的部分。partToExtract的有效值爲：HOST, PATH,
QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') 
from tableName;
www.tableName.com 
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1')
 from tableName;
v1

11、json解析 get_json_object

語法: get_json_object(string json_string, string path)
返回值: string
說明：解析json的字符串json_string,返回path指定的內容。如果輸入的json字符串無效，那麼返回NULL。

hive> select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

12、重複字符串函數：repeat

語法: repeat(string str, int n)
返回值: string
說明：返回重複n次後的str字符串

hive> select repeat('abc',5) from tableName;
abcabcabcabcabc

13、分割字符串函數: split

語法: split(string str, string pat)
返回值: array
說明: 按照pat字符串分割str，會返回分割後的字符串數組

hive> select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

1.5、集合統計函數

1、個數統計函數: count

語法: count(*), count(expr), count(DISTINCT expr[, expr_.])
返回值：Int

說明: count(*)統計檢索出的行的個數，包括NULL值的行；count(expr)返回指定字段的非空值的個數；count(DISTINCT
expr[, expr_.])返回指定字段的不同的非空值的個數

hive> select count(*) from tableName;
20
hive> select count(distinct t) from tableName;
10

2、總和統計函數: sum

語法: sum(col), sum(DISTINCT col)
返回值: double
說明: sum(col)統計結果集中col的相加的結果；sum(DISTINCT col)統計結果中col不同值相加的結果

hive> select sum(t) from tableName;
100
hive> select sum(distinct t) from tableName;
70

3、平均值統計函數: avg

語法: avg(col), avg(DISTINCT col)
返回值: double
說明: avg(col)統計結果集中col的平均值；avg(DISTINCT col)統計結果中col不同值相加的平均值

hive> select avg(t) from tableName;
50
hive> select avg (distinct t) from tableName;
30

4、最小值統計函數: min

語法: min(col)
返回值: double
說明: 統計結果集中col字段的最小值

hive> select min(t) from tableName;
20

5、最大值統計函數: max

語法: maxcol)
返回值: double
說明: 統計結果集中col字段的最大值

hive> select max(t) from tableName;
120

1.6、複合型構建函數

1、Map類型構建: map

語法: map (key1, value1, key2, value2, …)
說明：根據輸入的key和value對構建map類型

create table score_map(name string, score map<string,int>)
row format delimited fields terminated by '\t' 
collection items terminated by ',' map keys terminated by ':';

創建數據內容如下並加載數據
cd /kkb/install/hivedatas/
vim score_map.txt

zhangsan    數學:80,語文:89,英語:95
lisi    語文:60,數學:80,英語:99

加載數據到hive表當中去
load data local inpath '/kkb/install/hivedatas/score_map.txt' overwrite into table score_map;

map結構數據訪問：
獲取所有的value：
select name,map_values(score) from score_map;

獲取所有的key：
select name,map_keys(score) from score_map;

按照key來進行獲取value值
select name,score["數學"]  from score_map;

查看map元素個數
select name,size(score) from score_map;

2、Struct類型構建: struct

語法: struct(val1, val2, val3, …)
說明：根據輸入的參數構建結構體struct類型，似於C語言中的結構體，內部數據通過X.X來獲取，假設我們的數據格式是這樣的，電影ABC，有1254人評價過，打分爲7.4分

創建struct表
hive> create table movie_score( name string,  info struct<number:int,score:float> )row format delimited fields terminated by "\t"  collection items terminated by ":"; 

加載數據
cd /kkb/install/hivedatas/
vim struct.txt

ABC 1254:7.4  
DEF 256:4.9  
XYZ 456:5.4

加載數據
load data local inpath '/kkb/install/hivedatas/struct.txt' overwrite into table movie_score;

hive當中查詢數據
hive> select * from movie_score;  
hive> select info.number,info.score from movie_score;  
OK  
1254    7.4  
256     4.9  
456     5.4

3、array類型構建: array

語法: array(val1, val2, …)
說明：根據輸入的參數構建數組array類型

hive> create table  person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';

加載數據到person表當中去
cd /kkb/install/hivedatas/
vim person.txt

數據內容格式如下
biansutao   beijing,shanghai,tianjin,hangzhou
linan   changchu,chengdu,wuhan

加載數據
hive > load  data local inpath '/kkb/install/hivedatas/person.txt' overwrite into table person;

查詢所有數據數據
hive > select * from person;

按照下表索引進行查詢
hive > select work_locations[0] from person;

查詢所有集合數據
hive  > select work_locations from person; 

查詢元素個數
hive >  select size(work_locations) from person;

1.7、複雜型長度統計函數

1.Map類型長度函數: size(Map<k .V>)

語法: size(Map<k .V>)
返回值: int
說明: 返回map類型的長度

hive> select size(t) from map_table2;
2

2.array類型長度函數: size(Array<T>)

語法: size(Array<T>)
返回值: int
說明: 返回array類型的長度

hive> select size(t) from arr_table2;
4

3.類型轉換函數

類型轉換函數: cast
語法: cast(expr as <type>)
返回值: Expected "=" to follow "type"
說明: 返回轉換後的數據類型

hive> select cast('1' as bigint) from tableName;
1

1.8、explode函數

1、使用explode函數將hive表中的Map和Array字段數據進行拆分

lateral view用於和split、explode等UDTF一起使用的，能將一行數據拆分成多行數據，在此基礎上可以對拆分的數據進行聚合，lateral view首先爲原始表的每行調用UDTF，UDTF會把一行拆分成一行或者多行，lateral view在把結果組合，產生一個支持別名表的虛擬表。
其中explode還可以用於將hive一列中複雜的array或者map結構拆分成多行

需求：現在有數據格式如下
zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8  k3:v3,k4:v4

字段之間使用\t分割，需求將所有的child進行拆開成爲一列 
+----------+--+
| mychild  |
+----------+--+
| child1   |
| child2   |
| child3   |
| child4   |
| child5   |
| child6   |
| child7   |
| child8   |
+----------+--+

將map的key和value也進行拆開，成爲如下結果

+-----------+-------------+--+
| mymapkey  | mymapvalue  |
+-----------+-------------+--+
| k1        | v1          |
| k2        | v2          |
| k3        | v3          |
| k4        | v4          |
+-----------+-------------+--+

第一步：創建hive數據庫

創建hive數據庫d

第一步：創建hive數據庫

創建hive數據庫d

hive (default)> create database hive_explode;
hive (default)> use hive_explode;

第二步：創建hive表，然後使用explode拆分map和array

create  table hive_explode.t3(name string,
children array<string>,
address Map<string,string>)
row format delimited fields terminated by '\t'  
collection items terminated by ','
map keys terminated by ':' 
stored as textFile;

第三步：加載數據

node03執行以下命令創建表數據文件

cd  /kkb/install/hivedatas/

vim maparray
數據內容格式如下

zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8 k3:v3,k4:v4

hive表當中加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/maparray' into table hive_explode.t3;

第四步：使用explode將hive當中數據拆開

將array當中的數據拆分開

hive (hive_explode)> SELECT explode(children) AS myChild FROM hive_explode.t3;

將map當中的數據拆分開

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM hive_explode.t3;

2、使用explode拆分json字符串

需求：現在有一些數據格式如下：

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段與字段之間的分隔符是 |

我們要解析得到所有的monthSales對應的值爲以下這一列（行轉列）

4900
2090
6987

第一步：創建hive表

hive (hive_explode)> 
create table hive_explode.explode_lateral_view (
area string, 
goods_id string,
sale_info string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS textfile;

第二步：準備數據並加載數據

準備數據如下

cd /kkb/install/hivedatas
vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加載數據到hive表當中去

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/explode_json' overwrite into table hive_explode.explode_lateral_view;

第三步：使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from hive_explode.explode_lateral_view;

第四步：使用explode拆解Map

hive (hive_explode)> select explode(split(area,',')) as area from hive_explode.explode_lateral_view;

第五步：拆解json字段

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from hive_explode.explode_lateral_view;

然後我們想用get_json_object來獲取key爲monthSales的數據：

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as  sale_info from hive_explode.explode_lateral_view;
然後出現異常FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
UDTF explode不能寫在別的函數內
如果你這麼寫，想查兩個字段，select explode(split(area,',')) as area,good_id from explode_lateral_view;
會報錯FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'
使用UDTF的時候，只支持一個字段，這時候就需要LATERAL VIEW出場了

3、配合LATERAL VIEW使用

配合lateral view查詢多個字段

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

其中LATERAL VIEW explode(split(goods_id,','))goods相當於一個虛擬表，與原表explode_lateral_view笛卡爾積關聯。

也可以多重使用

hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view  LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;

也是三個表笛卡爾積的結果

最終，我們可以通過下面的句子，把這個json格式的一行數據，完全轉換成二維表的方式展現

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,  get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view   LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

總結：

Lateral View通常和UDTF一起出現，爲了解決UDTF不允許在select字段的問題。
Multiple Lateral View可以實現類似笛卡爾乘積。
Outer關鍵字可以把不輸出的UDTF的空結果，輸出成NULL，防止丟失數據。

1.9、列、行互轉函數

1.9.1、列轉行

1．相關函數說明

CONCAT(string A/col, string B/col…)：返回輸入字符串連接後的結果，支持任意個輸入字符串;

CONCAT_WS(separator, str1, str2,...)：它是一個特殊形式的 CONCAT()。第一個參數剩餘參數間的分隔符。分隔符可以是與剩餘參數一樣的字符串。如果分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過分隔符參數後的任何 NULL 和空字符串。分隔符將被加到被連接的字符串之間;

COLLECT_SET(col)：函數只接受基本數據類型，它的主要作用是將某字段的值進行去重彙總，產生array類型字段。

2．數據準備

表6-6 數據準備

name	constellation	blood_type
孫悟空	白羊座	A
老王	射手座	A
宋宋	白羊座	B
豬八戒	白羊座	A
冰冰	射手座	A

3．需求

把星座和血型一樣的人歸類到一起。結果如下：

射手座,A            老王|冰冰
白羊座,A            孫悟空|豬八戒
白羊座,B            宋宋

4．創建本地constellation.txt，導入數據

node03服務器執行以下命令創建文件，注意數據使用\t進行分割

cd /kkb/install/hivedatas
vim constellation.txt

孫悟空 白羊座 A
老王  射手座 A
宋宋  白羊座 B       
豬八戒 白羊座 A
鳳姐  射手座 A

5．創建hive表並導入數據

創建hive表並加載數據

hive (hive_explode)> create table person_info(  name string,  constellation string,  blood_type string)  row format delimited fields terminated by "\t";

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/constellation.txt' into table person_info;

6．按需求查詢數據

hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from    (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by  t1.base;

1.9.2、行轉列

1．函數說明

EXPLODE(col)：將hive一列中複雜的array或者map結構拆分成多行。

LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解釋：用於和split, explode等UDTF一起使用，它能夠將一列數據拆成多行數據，在此基礎上可以對拆分後的數據進行聚合。

2．數據準備

數據內容如下，字段之間都是使用\t進行分割

cd /kkb/install/hivedatas

vim movie.txt
《疑犯追蹤》  懸疑,動作,科幻,劇情
《Lie to me》 懸疑,警匪,動作,心理,劇情
《戰狼2》   戰爭,動作,災難

3．需求

將電影分類中的數組數據展開。結果如下：

《疑犯追蹤》  懸疑
《疑犯追蹤》  動作
《疑犯追蹤》  科幻
《疑犯追蹤》  劇情
《Lie to me》 懸疑
《Lie to me》 警匪
《Lie to me》 動作
《Lie to me》 心理
《Lie to me》 劇情
《戰狼2》   戰爭
《戰狼2》   動作
《戰狼2》   災難

4．創建hive表並導入數據

創建hive表

hive (hive_explode)> create table movie_info(
movie string, 
category array<string>
) 
row format delimited fields terminated by "\t" 
collection items terminated by ",";

加載數據

load data local inpath "/kkb/install/hivedatas/movie.txt" into table movie_info;

5．按需求查詢數據

hive (hive_explode)>  
select movie, category_name 
from 
movie_info lateral view explode(category) table_tmp as category_name;

1.10、reflect函數

reflect函數可以支持在sql中調用java中的自帶函數

使用java.lang.Math當中的Max求兩列中最大值

創建hive表

hive (hive_explode)>  
create table test_udf(col1 int,col2 int)
row format delimited fields terminated by ',';

準備數據並加載數據

cd /kkb/install/hivedatas

vim test_udf

1,2
4,3
6,4
7,5
5,6

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf' overwrite into table test_udf;

使用java.lang.Math當中的Max求兩列當中的最大值

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;

不同記錄執行不同的java內置函數

創建hive表

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

準備數據

cd /export/servers/hivedatas

vim test_udf2

java.lang.Math,min,1,2
java.lang.Math,max,2,3

加載數據

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf2' overwrite into table test_udf2;

執行查詢

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

判斷是否爲數字

使用apache commons中的函數，commons下的jar已經包含在hadoop的classpath中，所以可以直接使用。

使用方式如下：

hive (hive_explode)> select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

1.11、分析函數

1、分析函數的作用介紹

對於一些比較複雜的數據求取過程，我們可能就要用到分析函數，分析函數主要用於分組求topN，或者求取百分比，或者進行數據的切片等等，我們都可以使用分析函數來解決

2、常用的分析函數介紹

1、ROW_NUMBER()：

從1開始，按照順序，生成分組內記錄的序列,比如，按照pv降序排列，生成分組內每天的pv名次,ROW_NUMBER()的應用場景非常多，再比如，獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。

2、RANK() ：

生成數據項在分組中的排名，排名相等會在名次中留下空位

3、DENSE_RANK() ：

生成數據項在分組中的排名，排名相等會在名次中不會留下空位

4、CUME_DIST ：

小於等於當前值的行數/分組內總行數。比如，統計小於等於當前薪水的人數，所佔總人數的比例

5、PERCENT_RANK ：

分組內當前行的RANK值/分組內總行數

6、NTILE(n) ：

用於將分組數據按照順序切分成n片，返回當前切片值，如果切片不均勻，默認增加第一個切片的分佈。NTILE不支持ROWS BETWEEN，比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

3、需求描述

現有數據內容格式如下，分別對應三個字段，cookieid，createtime ，pv，求取每個cookie訪問pv前三名的數據記錄，其實就是分組求topN，求取每組當中的前三個值

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

第一步：創建數據庫表

在hive當中創建數據庫表

CREATE EXTERNAL TABLE cookie_pv (
cookieid string,
createtime string, 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ;

第二步：準備數據並加載

node03執行以下命令，創建數據，並加載到hive表當中去

cd /kkb/install/hivedatas
vim cookiepv.txt

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

加載數據到hive表當中去

load  data  local inpath '/kkb/install/hivedatas/cookiepv.txt'  overwrite into table  cookie_pv

第三步：使用分析函數來求取每個cookie訪問PV的前三條記錄

SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM cookie_pv 
WHERE rn1 <=  3 ;

2、Hive自定義函數

2.1、自定義函數的基本介紹

1）Hive 自帶了一些函數，比如：max/min等，但是數量有限，自己可以通過自定義UDF來方便的擴展。

2）當Hive提供的內置函數無法滿足你的業務處理需要時，此時就可以考慮使用用戶自定義函數（UDF：user-defined function）。

3）根據用戶自定義函數類別分爲以下三種：

（1）UDF（User-Defined-Function）

一進一出

（2）UDAF（User-Defined Aggregation Function）

聚集函數，多進一出

類似於：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一進多出

如lateral view explode()

4）官方文檔地址

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5）編程步驟：

（1）繼承org.apache.hadoop.hive.ql.UDF

（2）需要實現evaluate函數；evaluate函數支持重載；

6）注意事項

（1）UDF必須要有返回類型，可以返回null，但是返回類型不能爲void；

（2）UDF中常用Text/LongWritable等類型，不推薦使用java類型；

2.2、自定義函數開發

1、自定義函數的基本介紹

1）Hive 自帶了一些函數，比如：max/min等，但是數量有限，自己可以通過自定義UDF來方便的擴展。

2）當Hive提供的內置函數無法滿足你的業務處理需要時，此時就可以考慮使用用戶自定義函數（UDF：user-defined function）。

3）根據用戶自定義函數類別分爲以下三種：

（1）UDF（User-Defined-Function）

一進一出

（2）UDAF（User-Defined Aggregation Function）

聚集函數，多進一出

類似於：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一進多出

如lateral view explode()

4）官方文檔地址

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5）編程步驟：

（1）繼承org.apache.hadoop.hive.ql.UDF

（2）需要實現evaluate函數；evaluate函數支持重載；

6）注意事項

（1）UDF必須要有返回類型，可以返回null，但是返回類型不能爲void；

（2）UDF中常用Text/LongWritable等類型，不推薦使用java類型；

2、自定義函數開發

第一步：創建maven java 工程，並導入jar包

<repositories>
    <repository>
        <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0-cdh5.14.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.14.2</version>
    </dependency>
</dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.2</version>
         <executions>
             <execution>
                 <phase>package</phase>
                 <goals>
                     <goal>shade</goal>
                 </goals>
                 <configuration>
                     <filters>
                         <filter>
                             <artifact>*:*</artifact>
                             <excludes>
                                 <exclude>META-INF/*.SF</exclude>
                                 <exclude>META-INF/*.DSA</exclude>
                                 <exclude>META-INF/*/RSA</exclude>
                             </excludes>
                         </filter>
                     </filters>
                 </configuration>
             </execution>
         </executions>
     </plugin>
</plugins>
</build>

第二步：開發java類繼承UDF，並重載evaluate 方法

public class MyUDF extends UDF {
     public Text evaluate(final Text s) {
         if (null == s) {
             return null;
         }
         //**返回大寫字母         
         return new Text(s.toString().toUpperCase());
     }
 }

第三步：將我們的項目打包，並上傳到hive的lib目錄下

使用maven的package進行打包，將我們打包好的jar包上傳到node03服務器的/kkb/install/hive-1.1.0-cdh5.14.2/lib 這個路徑下

第四步：添加我們的jar包

重命名我們的jar包名稱

cd /kkb/install/hive-1.1.0-cdh5.14.2/lib
mv original-day_hive_udf-1.0-SNAPSHOT.jar udf.jar

hive的客戶端添加我們的jar包

0: jdbc:hive2://node03:10000> add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

第五步：設置函數與我們的自定義函數關聯

0: jdbc:hive2://node03:10000> create temporary function tolowercase as 'com.kkb.udf.MyUDF';

第六步：使用自定義函數

0: jdbc:hive2://node03:10000>select tolowercase('abc');

hive當中如何創建永久函數

在hive當中添加臨時函數，需要我們每次進入hive客戶端的時候都需要添加以下，退出hive客戶端臨時函數就會失效，那麼我們也可以創建永久函數來讓其不會失效

創建永久函數

1、指定數據庫，將我們的函數創建到指定的數據庫下面
0: jdbc:hive2://node03:10000>use myhive;

2、使用add jar添加我們的jar包到hive當中來
0: jdbc:hive2://node03:10000>add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

3、查看我們添加的所有的jar包
0: jdbc:hive2://node03:10000>list  jars;

4、創建永久函數，與我們的函數進行關聯
0: jdbc:hive2://node03:10000>create  function myuppercase as 'com.kkb.udf.MyUDF';

5、查看我們的永久函數
0: jdbc:hive2://node03:10000>show functions like 'my*';

6、使用永久函數
0: jdbc:hive2://node03:10000>select myhive.myuppercase('helloworld');

7、刪除永久函數
0: jdbc:hive2://node03:10000>drop function myhive.myuppercase;

8、查看函數
 show functions like 'my*';

15、Hive函數詳解與案列實戰

1、Hive系統內置函數

1.1、數值計算函數

1、取整函數: round

2、指定精度取整函數: round

3、向下取整函數: floor

4、向上取整函數: ceil

5、向上取整函數: ceiling

6、取隨機數函數: rand

1.2、日期函數

1、UNIX時間戳轉日期函數: from_unixtime

2、獲取當前UNIX時間戳函數: unix_timestamp

3、日期轉UNIX時間戳函數: unix_timestamp

4、指定格式日期轉UNIX時間戳函數: unix_timestamp

5、日期時間轉日期函數: to_date

6、日期轉年函數: year

7、日期轉月函數: month

8、日期轉天函數: day

9、日期轉小時函數: hour

10、日期轉分鐘函數: minute

12、日期轉周函數: weekofyear

13、日期比較函數: datediff

14、日期增加函數: date_add

15、日期減少函數: date_sub

1.3、條件函數

1、If函數: if

2、非空查找函數: COALESCE

3、條件判斷函數：CASE

4、條件判斷函數：CASE

1.4、字符串函數

1、字符串長度函數：length

2、字符串反轉函數：reverse

3、字符串連接函數：concat

4、字符串連接並指定字符串分隔符：concat_ws

5、字符串截取函數：substr

6、字符串截取函數：substr,substring

7、字符串轉大寫函數：upper,ucase

8、字符串轉小寫函數：lower,lcase

9、去空格函數：trim

10、url解析函數 parse_url

11、json解析 get_json_object

12、重複字符串函數：repeat

13、分割字符串函數: split

1.5、集合統計函數

1、個數統計函數: count

2、總和統計函數: sum

3、平均值統計函數: avg

4、最小值統計函數: min

5、最大值統計函數: max

1.6、複合型構建函數

1、Map類型構建: map

2、Struct類型構建: struct

3、array類型構建: array

1.7、複雜型長度統計函數

1.Map類型長度函數: size(Map<k .V>)

2.array類型長度函數: size(Array<T>)

3.類型轉換函數

1.8、explode函數

1、使用explode函數將hive表中的Map和Array字段數據進行拆分

第一步：創建hive數據庫

第一步：創建hive數據庫

第二步：創建hive表，然後使用explode拆分map和array

第三步：加載數據

第四步：使用explode將hive當中數據拆開

2、使用explode拆分json字符串

第一步：創建hive表

第二步：準備數據並加載數據

第三步：使用explode拆分Array

第四步：使用explode拆解Map

第五步：拆解json字段

3、配合LATERAL VIEW使用

1.9、列、行互轉函數

1.9.1、列轉行

1．相關函數說明

2．數據準備

3．需求

4．創建本地constellation.txt，導入數據

5．創建hive表並導入數據

6．按需求查詢數據

1．函數說明