hive學習(5)--- Partitions分區的使用(包括動態分區)

下面這個文章很好的講解了Partitions的使用方法

http://www.aahyhaa.com/archives/316

其他參考文章:

http://p-x1984.iteye.com/blog/1156408

http://www.cnblogs.com/tangtianfly/archive/2012/03/13/2393449.html

http://www.2cto.com/kf/201210/160777.html

http://blog.csdn.net/acmilanvanbasten/article/details/17252673



但是,這裏還需要補充一點,也是我學習過程中的一個誤區:

對於具備分區字段的表,導入的數據,只能導入到指定的分區,而我曾經以爲,數據導入時,會自動根據字段進行分區。這有什麼區別呢?

比如,我的表按照city分區,我有一份各個城市的天氣,大概數據如下:

2014-05-23|07:33:58 China shenzhen rain -28 -21 199
2014-05-23|07:33:58 China hangzhou fine -26 -19 200
2014-05-23|07:33:58 China hangzhou fine 6 14 200

然後我把這個數據加載到表中:load data inpath '/tmp/wetherdata4.txt' into table weatherpartion partition(city='hangzhou');

我的預期是:會根據city字段創建2個分區目錄,一個叫hangzhou,一個叫shenzhen,並且會shenzhen的這一行記錄放到shenzhen這個分區,把杭州的2行記錄放到hangzhou

但實際上,只創建了一個分區hangzhou,並且3條數據都加載進了hangzhou這個分區,這很明顯,數據與分區沒有一致。

此時,再仔細思考下分區的使用場景,我的理解是:

1、數據文件生成的時候,會根據某個字段生成不同的文件,比如場景的日誌文件,每天會產生一個,同一天的日誌會放到一個文件中

2、不同的數據文件,會累加到一個表中做大數據分析


根據上面的解釋,就比較好理解爲什麼導入的時候,一個文件只能導入到一個指定的分區(可能是多個條件指定的唯一分區,比如city=hangzhou,country=china),

再看看創建表時候的語句,被指定分區的字段其實不是表創建時候的字段(比如city字段),也就是說,其實用於分區的字段,並不應該作爲數據真正的字段,只能認爲是一個輔助字段,而爲了hql語法上的支持,故hive會在創建的時候把分區字段也加入到表的字段中,因爲語法需要。

比如每個城市會把自己的天氣數據彙總給某個機構,那機構就會對city進行分區,而每個城市彙總的天氣數據裏,可以沒有city這個字段,因爲一個城市,它的city值肯定是一樣的,寫與不寫都無所謂。此時,這個機構在導入數據的時候指明 partition(city='XXX');即可,這樣,每個城市的天氣數據就導入的對應的目錄下,當查找指定城市的天氣時,系統只會訪問對應目錄下的原始數據文件,不會訪問表中其他城市的原始數據文件,從而提高效率。

    create table weatherpartion
    (date string, weath string, 
    minTemperat int, maxTemperat int,
    pmvalue int) partitioned by (country string, city string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' '
    STORED AS TEXTFILE;


其實hive對分區的設計,正是符合了它的本意:不對數據文件做任何修改。如果是根據某個字段,自動分區,那勢必會把一個大文件拆成多個小文件,這就違背了不修改數據文件的初衷了。


爲了驗證性能,我們做一個實驗,同樣的數據,一個是按pcity分區,一個不分區,看看同樣的hql的執行速度:

分區情況:

hive> dfs -ls /user/hive/warehouse/weatherpartion;
Found 6 items
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:33 /user/hive/warehouse/weatherpartion/pcity=beijin
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=guangzhou
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=hangzhou
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=nanjing
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shanghai
drwxr-xr-x   - hadoop supergroup          0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shenzhen


在同樣的情況下,多次執行同一句hql,取耗時的平均值:

一:未分區的weather表,三表聯合查詢,排序

select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weather wh on (cy.name=wh.city) 
join pminfo pm on (pm.pmvalue=wh.pmvalue) 
where wh.city='hangzhou' and wh.weath='fine' and wh.minTemperat in 
( -18,25,43) order by maxTemperat DESC limit 20;

執行5次,耗時如下:

Job 0: Map: 5  Reduce: 2   Cumulative CPU: 41.52 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.14 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.81 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 470 msec

Time taken: 72.781 seconds, Fetched: 20 row(s)


Job 0: Map: 5  Reduce: 2   Cumulative CPU: 45.71 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.25 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.82 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 780 msec

Time taken: 66.584 seconds, Fetched: 20 row(s)


Job 0: Map: 5  Reduce: 2   Cumulative CPU: 43.55 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.29 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.82 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 660 msec

Time taken: 62.12 seconds, Fetched: 20 row(s)


Job 0: Map: 5  Reduce: 2   Cumulative CPU: 41.09 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.12 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.8 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 10 msec

Time taken: 58.33 seconds, Fetched: 20 row(s)


Job 0: Map: 5  Reduce: 2   Cumulative CPU: 42.68 sec   HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.0 sec   HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.82 sec   HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 49 seconds 500 msec

Time taken: 62.355 seconds, Fetched: 20 row(s)


二:按city分區的weather表,三表聯合查詢,排序

select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weatherpartion wh on (cy.name=wh.city) 
join pminfo pm on (pm.pmvalue=wh.pmvalue) 
where wh.pcity='hangzhou' and wh.weath='fine' and wh.minTemperat in 
( -18,25,43) order by maxTemperat DESC limit 20;

執行5次,耗時如下:

Job 0: Map: 2  Reduce: 1   Cumulative CPU: 10.68 sec   HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.35 sec   HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.82 sec   HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 850 msec

Time taken: 48.127 seconds, Fetched: 20 row(s)


MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 10.4 sec   HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.31 sec   HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.84 sec   HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 550 msec

Time taken: 47.386 seconds, Fetched: 20 row(s)


MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 10.8 sec   HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 5.38 sec   HDFS Read: 7797835 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 1.85 sec   HDFS Read: 7998278 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 30 msec

Time taken: 47.853 seconds, Fetched: 20 row(s)


三、結論

CPU消耗的時間大幅減少,但總時間提升的並不太多,因爲還涉及到一些調度、通信、洗牌、切分等的時間


四、動態分區

靜態分區要求導入數據的時候指定導入的分區,如果有大量不同分區的數據需要導入,則需要手動執行N次命令,相當麻煩,所以hive提供動態分區的功能。

也就是說,hive可以根據設定的分區,把數據分到對應的分區中,它包括嚴格模式和寬鬆模式。

默認情況下動態分區的功能是關閉的,需要用戶手動打開,當打開動態分區後,默認情況下是嚴格模式

打開動態分區的命令:set hive.exec.dynamic.partition=true;


通過如下例子說明

需求:把天氣表中的數據按照城市和天氣狀況(晴、雨)進行2級分區,例子使用嚴格模式,所以城市爲靜態分區,天氣情況weath爲動態分區,從而構成2級分區

第一步,創建目標表,指定分區字段:

  create table weather_sub 
    (date string, pmvalue int) partitioned by (city string, weath string)         //此處需要指定2個分區字段
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ' '
    STORED AS TEXTFILE;

再執行動態分區插入數據的語句:
insert overwrite table weather_sub 
    partition (city='hangzhou',weath)
    select w.date,w.pmvalue,w.weath   //此處需要注意,這裏只填了3個字段,但目標表事實上有4個字段的,其中缺少的字段正是weather_sub.city
    from weather w
    where w.city='hangzhou';

//w.date,w.pmvalue分別對應目標表的date和pmvalue,city字段使用分區指定的hangzhou,w.weath對應weath


檢驗結果:

hive> select * from weather_sub where city='hangzhou' and weath='fine' limit 5;
OK
2014-05-23|07:33:58     200     hangzhou        fine
2014-05-23|07:33:58     200     hangzhou        fine
2014-05-23|07:33:58     200     hangzhou        fine
2014-05-23|07:33:58     200     hangzhou        fine
2014-05-23|07:33:58     200     hangzhou        fine
Time taken: 0.101 seconds, Fetched: 5 row(s)

按分區查找,速度很快,沒有用MP程序


查看dfs情況:

hive> dfs -ls /user/hive/warehouse/weather_sub/city=hangzhou;                  
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=cloudy
drwxr-xr-x   - hadoop supergroup          0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=fine


最後附上刪除表的命令:

drop table weather_sub;

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章