一、理論基礎
1、Hive分區背景
在Hive Select查詢中一般會掃描整個表內容,會消耗很多時間做沒必要的工作。有時候只需要掃描表中關心的一部分數據,因此建表時引入了partition概念。
2、Hive分區實質
因爲Hive實際是存儲在HDFS上的抽象,Hive的一個分區名對應hdfs的一個目錄名,並不是一個實際字段。
3、Hive分區的意義
輔助查詢,縮小查詢範圍,加快數據的檢索速度和對數據按照一定的規格和條件進行管理。
4、常見的分區技術
hive表中的數據一般按照時間、地域、類別等維度進行分區。
二、分區操作
(一)、靜態分區
1、單分區
(1)創建表
hive> create table student(id string,name string) partitioned by(classRoom string) row format delimited fields terminated by ',';
OK
Time taken: 0.259 seconds
注意:partitioned by()要放在row format...的前面;partitioned by()裏面的分區字段不能和表中的字段重複,否則報錯;
(2)加載數據
hive> load data local inpath '/home/test/stu.txt' into table student partition(classroom='002');
Loading data to table default.student partition (classroom=002)
OK
Time taken: 1.102 seconds
(3)查看分區
hive> show partitions student;
OK
classroom=002
Time taken: 0.071 seconds, Fetched: 1 row(s)
(4)hdfs中分區展示
(5)再加載一組數據到新的分區
hive> load data local inpath '/home/test/stu.txt' into table student partition(classroom='003');
Loading data to table default.student partition (classroom=003)
OK
Time taken: 0.722 seconds
hive> select * from student;
OK
001 xiaohong 002
002 xiaolan 002
001 xiaohong 003
002 xiaolan 003
Time taken: 0.097 seconds, Fetched: 4 row(s)
hive> show partitions student;
OK
classroom=002
classroom=003
Time taken: 0.071 seconds, Fetched: 2 row(s)
2、多分區
(1)創建表
hive> create table stu(id string,name string) partitioned by(school string,classRoom string) row format delimited fields terminated by ',';
OK
Time taken: 0.074 seconds
hive> desc stu;
OK
id string
name string
school string
classroom string
# Partition Information
# col_name data_type comment
school string
classroom string
Time taken: 0.03 seconds, Fetched: 10 row(s)
(2)加載數據
hive> load data local inpath '/home/test/stu.txt' into table stu partition(school='AA',classroom='005');
Loading data to table default.stu partition (school=AA, classroom=005)
OK
Time taken: 0.779 seconds
hive> select * from stu;
OK
001 xiaohong AA 005
002 xiaolan AA 005
Time taken: 0.087 seconds, Fetched: 2 row(s)
(3)查看分區
hive> show partitions stu;
OK
school=AA/classroom=005
Time taken: 0.048 seconds, Fetched: 1 row(s)
注意:這是個嵌套目錄;
(4)hdfs中分區展示
(5)增加數據效果
hive> load data local inpath '/home/test/stu.txt' into table stu partition(school='BB',classroom='001');
Loading data to table default.stu partition (school=BB, classroom=001)
OK
Time taken: 0.272 seconds
hive> load data local inpath '/home/test/stu.txt' into table stu partition(school='AA',classroom='001');
Loading data to table default.stu partition (school=AA, classroom=001)
OK
Time taken: 0.268 seconds
(二)、動態分區
靜態分區與動態分區的主要區別在於靜態分區是手動指定,而動態分區是通過數據來進行判斷。詳細來說,靜態分區的列實在編譯時期,通過用戶傳遞來決定的;動態分區只有在SQL執行時才能決定。
1、啓用hive動態分區
在hive會話中設置兩個參數:
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
2、創建表
(1)首先準備一個帶有靜態分區的表
hive> select * from stu;
OK
001 xiaohong AA 001
002 xiaolan AA 001
001 xiaohong AA 005
002 xiaolan AA 005
001 xiaohong BB 001
002 xiaolan BB 001
Time taken: 0.105 seconds, Fetched: 6 row(s)
(2)copy一張表結構相同的表
hive> create table stu01 like stu;
OK
Time taken: 0.068 seconds
hive> desc stu;
OK
id string
name string
school string
classroom string
# Partition Information
# col_name data_type comment
school string
classroom string
Time taken: 0.022 seconds, Fetched: 10 row(s)
(3)加載數據,分區成功
不指定具體的學校和班級,讓系統自動分配;
hive> insert overwrite table stu01 partition(school,classroom)
> select * from stu;
hive> select * from stu;
OK
001 xiaohong AA 001
002 xiaolan AA 001
001 xiaohong AA 005
002 xiaolan AA 005
001 xiaohong BB 001
002 xiaolan BB 001
Time taken: 0.091 seconds, Fetched: 6 row(s)
hive> select * from stu01;
OK
001 xiaohong AA 001
002 xiaolan AA 001
001 xiaohong AA 005
002 xiaolan AA 005
001 xiaohong BB 001
002 xiaolan BB 001
Time taken: 0.081 seconds, Fetched: 6 row(s)