hive中使用case、if:一個region統計業務(hive條件函數case、if、COALESCE語法介紹:CONDITIONAL FUNCTIONS IN HIVE)

前言:Hive ql自己設計總結
1,遇到複雜的查詢情況,就分步處理。將一個複雜的邏輯,分成幾個簡單子步驟處理。
2,但能合在一起的,儘量和在一起的。比如同級別的多個concat函數合併一個select

也就是說,字段之間是並行的同級別處理,則放在一個hive ql;而字段間有前後處理邏輯依賴(判斷、補值、計算)則可分步執行,提前將每個字段分別處理好,然後進行相應的分步簡單邏輯處理。

一、 場景:日誌中region數據處理(國家,省份,城市)
select city_id,province_id,country_id
from wizad_mdm_cleaned_hdfs
where city_id = '' or country_id = '' or province_id = ''
group by city_id,province_id,country_id
二 、發現日誌中有空數據:
38              1      
        73      1      
        75      1      
64      81             
        76      1      
                      (全空)
        77         
三、設定過濾邏輯
if country_id='' 
         if province_id != '' then 
                   if city_id = '' thenCONCAT('region_','1','_',province_id)
                   elseCONCAT('region_','1','_',province_id,'_',city_id)
         else 
                   if city_id != '' thenCONCAT('region_','1','_',parent_region_id,'_',city_id)
else
         if province_id='' 
                   if city_id !='' thenCONCAT('region_',country_id,'_',parent_region_id,'_',city_id)
四、hive ql實現
SET mapred.queue.names=queue3;
SET mapred.reduce.tasks=14;
DROP TABLE IF EXISTS test_lmj_mdm_tmp1;
CREATE TABLE test_lmj_mdm_tmp1 AS
SELECT
guid,
(CASE country_id
WHEN '' THEN (CASE WHEN province_id='' THENIF(city_id = '','',CONCAT('region_','1','_',parent_region_id,'_',city_id)) ELSEIF(city_id='',CONCAT('region_','1','_',province_id),CONCAT('region_','1','_',province_id,'_',city_id))END)
ELSE (CASE when province_id='' THENIF(city_id='',CONCAT('region_',country_id),CONCAT('region_',country_id,'_',parent_region_id,'_',city_id))ELSE IF(city_id = '', CONCAT('region_',country_id,'_',province_id),CONCAT('region_',country_id,'_',province_id,'_',city_id))END)
END )AS region,
(CASE connection_type WHEN '2' THENCONCAT('carrier_','wifi') ELSE CONCAT('carrier_',c.element_id) END) AS carrier,
SUM(CASE WHEN logtype = '1' THEN 1 ELSE 0END) AS imp_pv,
SUM(CASE WHEN logtype = '2' THEN 1 ELSE 0END) AS clk_pv
FROM wizad_mdm_cleaned_hdfs a
left outer joinwizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = '7') c
ON (a.adn_id = c.ad_network_id ANDa.carrier_id = c.mapping_id)
left outer joinwizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id,parent_region_id from wizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day = '2015-01-01'
GROUP BY guid,
(CASE country_id
WHEN '' THEN (CASE WHEN province_id = ''THEN IF(city_id = '','',CONCAT('region_','1','_',parent_region_id,'_',city_id))ELSEIF(city_id='',CONCAT('region_','1','_',province_id),CONCAT('region_','1','_',province_id,'_',city_id))END)
ELSE (CASE when province_id='' THENIF(city_id='',CONCAT('region_',country_id),CONCAT('region_',country_id,'_',parent_region_id,'_',city_id))ELSEIF(city_id='',CONCAT('region_',country_id,'_',province_id),CONCAT('region_',country_id,'_',province_id,'_',city_id))END)
END),
(CASE connection_type WHEN '2' THENCONCAT('carrier_','wifi') ELSE CONCAT('carrier_',c.element_id) END);
五、Hive ql語句分析

上例中使用case和if,語法參見最後{七、CONDITIONAL FUNCTIONS IN HIVE}
注意:
1,case特殊用法:case後可無對象,而在when後加條件判斷語句,如,case when a=1 then true else false end;
2,select後的變換字段提取,對應在groupby中也要有,如carrier的case處理。(否則select不到)。但group by 後不能起表別名(as),select後可以。substring處理time時也一樣在select和group by都有,
3,left outerjoin用子查詢減少join時的內存
4,IF看版本才能用

六、Hive ql設計重構
初學者如我,總設計複雜邏輯,變態語句。
實際上,有經驗的人面對邏輯太過複雜,應該分步操作。一個sql的高級同事重構上例。分兩步:
 - 1)先分別給各字段補充合理值(能補充的補充,不能的置空)
 - 2)然後在region處理時直接過濾掉非法值記錄
6.1步驟一語句
DROP TABLE IF EXISTS test_lmj_mdm_tmp;
CREATE TABLE test_lmj_mdm_tmp AS
SELECT
guid,
CONCAT('adn_',adn_id) AS adn,
CONCAT('time_',substr(createtime,12,2)) AS hour,
CONCAT('os_',os_id) AS os,
case when (country_id = '' or country_id = 'NULL' or country_id isnull)
            and (province_id ='' or province_id = 'NULL' or province_id is null)
            and (city_id = ''or city_id = 'NULL' or city_id is null)
        then ''
     when (country_id = '' orcountry_id = 'NULL' or country_id is null)
            and (province_id<> '' or province_id <> 'NULL' or province_id is not null orcity_id <> '' or city_id <> 'NULL' or city_id is not null)
        then '1'
     else country_id end ascountry_id,
case when (province_id = '' or province_id = 'NULL' or province_idis null)
            ande.parent_region_id <> '' and e.parent_region_id <> 'NULL' ande.parent_region_id is not null
        thene.parent_region_id
     else province_id end asprovince_id,
city_id,
CONCAT('campaign_',b.campaign_id) AS campaign,
CONCAT('interest_',b.industry_id) AS interest,
CONCAT('brand_',b.brand_id) AS brand,
(CASE connection_type WHEN '2' THEN CONCAT('carrier_','wifi') ELSECONCAT('carrier_',c.element_id) END) AS carrier,
CONCAT('appcategory_',d.wizad_category) AS appcategory,
uid,
SUM(CASE WHEN logtype = '1' THEN 1 ELSE 0 END) AS imp_pv,
SUM(CASE WHEN logtype = '2' THEN 1 ELSE 0 END) AS clk_pv
FROM ${clean_log_table} a
left outer join wizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = '7') c
ON (a.adn_id = c.ad_network_id AND a.carrier_id = c.mapping_id)
left outer join wizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id, parent_region_id fromwizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day < '${pt}' and a.day >= '${time_span}'
GROUP BY guid,
CONCAT('adn_',adn_id),
CONCAT('time_',substr(createtime,12,2)),
CONCAT('os_',os_id),
case when (country_id = '' or country_id = 'NULL' or country_id isnull)
          and (province_id ='' or province_id = 'NULL' or province_id is null)
          and (city_id = '' orcity_id = 'NULL' or city_id is null)
          then ''
     when (country_id = '' orcountry_id = 'NULL' or country_id is null)
          and (province_id<> '' or province_id <> 'NULL' or province_id is not null orcity_id <> '' or city_id <> 'NULL' or city_id is not null)
          then '1'
     else country_id end,
case when (province_id = '' or province_id = 'NULL' or province_idis null)
          and e.parent_region_id <> '' ande.parent_region_id <> 'NULL' and e.parent_region_id is not null
          thene.parent_region_id
     else province_id end,
city_id,
CONCAT('campaign_',b.campaign_id),
CONCAT('interest_',b.industry_id),
CONCAT('brand_',b.brand_id),
(CASE connection_type WHEN '2' THEN CONCAT('carrier_','wifi') ELSECONCAT('carrier_',c.element_id) END),
CONCAT('appcategory_',d.wizad_category),
UID;
6.2步驟二語句
SELECT guid,CONCAT('region_',country_id,'_',province_id,(case when city_id<> '' and city_id <> 'NULL' and city_id is not null thenconcat('_',city_id) else '' end)) AS fixeddim,UID,SUM(imp_pv) AS pv
FROM test_lmj_mdm_tmp
where imp_pv > 0
and country_id <> ''
and country_id <> 'NULL'
and country_id is not null
and province_id <> ''
and province_id <> 'NULL'
and province_id is not null
GROUP BY guid,CONCAT('region_',country_id,'_',province_id,(case whencity_id <> '' and city_id <> 'NULL' and city_id is not null thenconcat('_',city_id) else '' end)),
UID

以下引自網絡

七、CONDITIONALFUNCTIONS IN HIVE

Hive supports three types of conditional functions. These functions
are listed below:

IF( Test Condition, True Value, False Value )

The IF condition evaluates the “Test Condition” and if the “Test
Condition” is true, then it returns the “True Value”. Otherwise, it
returns the False Value. Example: IF(1=1, ‘working’, ‘not working’)
returns ‘working’

COALESCE( value1,value2,… )

The COALESCE function returns the fist not NULL value from the list of
values. If all the values in the list are NULL, then it returns NULL.
Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

CASE Statement

The syntax for the case statement is: CASE [ expression ]

    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    WHEN conditionn THEN resultn
    ELSE result END

Here expression is optional. It is the value that you are comparing to
the list of conditions. (ie: condition1, condition2, … conditionn).

All the conditions must be of same datatype. Conditions are evaluated
in the order listed. Once a condition is found to be true, the case
statement will return the result and not evaluate the conditions any
further.
轉自:http://www.folkstalk.com/2011/11/conditional-functions-in-hive.html
All the results must be of same datatype. This is the value returned
once a condition is found to be true.

IF no condition is found to be true, then the case statement will
return the value in the ELSE clause. If the ELSE clause is omitted and
no condition is found to be true, then the case statement will return
NULL

Example:

    CASE   Fruit
        WHEN 'APPLE' THEN 'The owner is APPLE'
        WHEN 'ORANGE' THEN 'The owner is ORANGE'
        ELSE 'It is another Fruit'
    END

The other form of CASE is

    CASE 
         WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
         WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
         ELSE 'It is another Fruit'
    END
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章