22HIVE的分區分桶——好程序

爲什麼要分區？
隨着系統運行時間增長，表的數據量越來越大，而hive查詢通常是全表掃描，這樣會導致大量不必要的數據掃描，從而大大降低了查詢效率。
從而引進了分區技術，使用分區技術，避免hive全表掃描，提升查詢效率。

分區的技術
PARTITIONED BY (column_name data_type)
1、hive分區是區分大小寫的
2、hive的分區本質是在表目錄下創建分區目錄，但是該分區字段是一個僞列，不真實存在於數據中。age=19
3、一張表中可以有一個或多個分區，分區下面也可以有一個或者多個分區。

如何分區？
根據業務需求，通常採用年、月、日、地區等。

分區的意義
可以讓用戶在做數據統計時縮小數據掃描的範圍，因爲可以在select時指定分區條件

創建一層分區

create table if not exists part1(
uid int,
uname string,
uage int
)
PARTITIONED BY (country string)
row format delimited 
fields terminated by ','
;

分區表的數據導入方式：part1表名然後指定分區partition(country='china'); 分區目錄

將文件從哪裏上傳到哪個表裏面，然後指定分區

load data local inpath '/hivedata/stu1.txt' into table part1 partition(country='china');

load data local inpath '/hivedata/stu_japan.txt' into table part1 partition(country='japan');

load data local inpath '/hivedata/stu1.txt' into table part1 partition(COUNTRY='CHINA');

做分區一定要帶條件查詢（內容是區分大小寫的）

字段等於分區的內容

創建二級分區：
create table if not exists part2(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited
fields terminated by ','
;

加載數據：
load data local inpath '/hivedata/stu1.txt' into table part2 partition(year='2019',month='07');
load data local inpath '/hivedata/stu_japan.txt' into table part2 partition(year='2019',month=08);
load data local inpath '/hivedata/stu1.txt' into table part2 partition(year='2018',month='07');

load data local inpath '/hivedata/stu1.txt' into table part2 partition(year='2019',month='09');
load data local inpath '/hivedata/stu1.txt' into table part2 partition(year='2019',month='06');

創建三級分區：
create table if not exists part3(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string,day string)
row format delimited
fields terminated by ','
;

加載數據：
load data local inpath '/hivedata/stu1.txt' into table part3 partition(year='2019',month='07',day='19');
load data local inpath '/hivedata/stu_japan.txt' into table part3 partition(year='2019',month='08',day='01');
load data local inpath '/hivedata/stu1.txt' into table part3 partition(year='2018',month='07',day='18');

顯示分區：
show partitions part3;

修改分區：
分區的名字不能修改（可以手動改）

增加分區：
alter table part2 add partition(year='2018',month='08');

alter table part2 add partition(year='2018',month='09') partition(year='2018',month='10') ;

增加分區並且設置數據：
alter table part2 add partition(year='2018',month='06') LOCATION '/user/hive/warehouse/gp1923.db/part2/year=2019/month=08';

修改分區的hdfs的存儲路徑：（指定的路徑是hdfs的絕對路徑）
ALTER TABLE part2 PARTITION(year='2018',month='06') SET LOCATION 'hdfs://gp1923/user/hive/warehouse/gp1923.db/part2/year=2018/month=06';

刪除分區：
alter table part2 drop partition(year='2018',month='06');

刪除多個：
alter table part2 drop partition(year='2018',month='09'),partition(year='2018',month='10');

分區類型：
靜態分區：加載數據時指定分區的值
動態分區：數據未知，根據分區的值確定需要創建的分區。
混合分區：靜態+動態動態分區的屬性：

<!---動態分區的屬性-->
<property>
    <name>hive.exec.dynamic.partition</name>
    <value>true</value>
    <description>Whether or not to allow dynamic partitions in DML/DDL.</description>
  </property>
<!---動態分區的模式，有嚴格模式和非嚴格模式-->
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>strict</value>
    <description>
      In strict mode, the user must specify at least one static partition
      in case the user accidentally overwrites all partitions.
      In nonstrict mode all partitions are allowed to be dynamic.
    </description>
  </property>
<!---一條sql語句中動態分區最大的數量-->
<property>
    <name>hive.exec.max.dynamic.partitions</name>
    <value>1000</value>
    <description>Maximum number of dynamic partitions allowed to be created in total.</description>
  </property>
<!---每個節點多少個區-->
  <property>
    <name>hive.exec.max.dynamic.partitions.pernode</name>
    <value>100</value>
    <description>Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.</description>
  </property>

測試動態分區：

create table if not exists part_tmp 
as 
select * from part2
where year = '2019'
;

--然後往分區表裏面寫數據

create table if not exists dy_part(
uid int,
uname string,
uage int
)
partitioned by (year string,month string)
row format delimited
fields terminated by ','
;

動態分區不能使用load方式加載數據
load data local inpath '/hivedata/stu1.txt' into table dy_part partition(year='2019',month);

動態分區採用insert into的方式（動態分區來源於表）

set hive.exec.dynamic.partition.mode=nonstrict;  設置爲動態分區
--分區來源於表
insert overwrite table dy_part partition(year,month)
select * from part_tmp
;

混合：（注意字段的個數匹配）

--嚴格模式下要指定一個靜態分區
--2019是不用再查的了，已經指定了，其他的字段需要查詢出來
insert overwrite table dy_part partition(year='2019',month)
select uid,uname,uage,month from part_tmp
;

設置hive執行的嚴格模式：

<property>
    <name>hive.mapred.mode</name>
    <value>nonstrict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>

注意事項：
1、hive的分區使用的是表外字段，分區字段是一個僞列但是可以做查詢過濾。
2、分區字段不建議使用中文
3、不太建議使用動態分區，因爲動態分區將會使用mr來進行查詢數據。如果分區數量過多將會導致namenode和yarn的性能瓶頸，所以建議動態分區前儘可能的預知分區數量。(分區的本質在表目錄下創建子目錄，目錄是在hdfs上，namenode管理目錄樹結構，創建一個寫一個，創建目錄時namenode則一直在寫文件)
4、分區屬性修改均可以使用手動元數據和hdfs數據內容的方式來進行

--分桶---------------------------

爲什麼要分桶
單個分區或者表中的數據量越來越大，當分區不能更細粒度的劃分數據時，所以會採用分桶技術將數據更細粒度的劃分和管理。

分桶技術：
[CLUSTERED BY (COLUMN_NAME)
[SORTED BY COLUMN_NAME ASC| DESC ] INTO 4 BUCKETS]
分桶關鍵字：bucket

默認採用對分桶字段進行hash值%總桶數的的餘數就是分桶的桶數。

分桶的意義
1、爲了保存分桶查詢結果的分桶結構（數據已經按照分桶字段進行了hash散列）
2、分桶的應用場景：數據抽樣和JOIN時可以提高MR的執行效率

##創建分桶表（可以按照id分，也可以按照名字分）

create table if not exists buc1(
uid int,
uname string,
uage int
)
clustered by (uid) into 4 buckets --按照什麼字段來分桶，這個設置爲4桶
row format delimited 
fields terminated by ','
;

加載數據：
不能通過load方式加載，否則看到的是下面的截圖，沒有分桶的概念
load data local inpath '/hivedata/stu1.txt' into table buc1;

1、設置強制分桶屬性：先打開，默認是fase

<property>
    <name>hive.enforce.bucketing</name>
    <value>false</value>
    <description>Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.</description>
  </property>

select uid,uname,uage 
from part_tmp
where year = '2019' and month = '06'
cluster by (uid)
;

2、需要設置reduce的數量：（如果reducer的數量和分桶的個數不一致時，請手動設置：）
set mapreduce.job.reduces=4;（設置分桶的數量）

Reduce是的數量是1個時候纔可以跑本地，多個時候是不可以跑本地的

分桶數據需要通過insert into 方式加載

insert overwrite table buc1
select * from t4
cluster by (id)
;

創建分桶表，指定排序字段和排序規則

create table if not exists buc2(
uid int,
uname string,
uage int
)
clustered by (uid)
sorted by (uid asc) into 4 buckets--4桶    sorted by (uid asc) 對數據的排序沒有影響
row format delimited   --指定排序規則
fields terminated by ','
;

桶內有序，桶與桶是無序的

導數據：（數據最後的順序是由語句來操作的）

--正序
insert into buc2
select * from t4
distribute by (id) sort by (id asc) 
;
--倒序 
insert overwrite table buc2
select * from t4
distribute by (id) sort by (addr desc) 
;
--字段的正序逆序排列，是由語句來決定的

對分桶進行查詢：

-- out of相當於一個循環，後面是循環的數量
--查詢全部：
select * from buc2;
select * from buc2 tablesample(bucket 1 out of 1);
--bucket 1表示從從第幾桶開始取數據，out of 1就是把整體的數據重製爲1桶

--查詢第幾桶：將數據分爲4桶，4桶裏面取第二桶。餘數+1是桶數
select * from buc2 tablesample(bucket 2 out of 4);
--4桶裏面取第二桶

--查詢餘數爲1的
select * from buc2 tablesample(bucket 2 out of 2);


select * from buc2 tablesample(bucket 2 out of 2 on uid);
 
--8桶裏面取第二桶
select * from buc2 tablesample(bucket 2 out of 8);
 
--指定分桶依據，可以用uid作爲分桶字段
select * from buc2 tablesample(bucket 2 out of 7 on uid);

tablesample(bucket x out of y on uid)
x:代表從第幾桶開始
y：查詢的總桶數，y可以是總桶數的倍數或者因子，x不能大於y

規律如下

1、不壓縮不拉伸：

1   2   3   4
1   1+4 --如果沒拿完，則繼續從1開始
for 1 to 4 --從1開始循環4桶
i=1 1 --i等於1，取第一桶
i=2 x --i等於2，不拿
3 x    --i等於3，不拿
4   x --i等於4，不拿
y --到這裏就已經結束了
5   1

壓縮：
1   1+4/2 1+4/2+4/2
1   2   3   4
1 out of 2 ---這個2表示多少個數進行循環，1是循環中取第一桶。這個是i循環完了桶沒有迭代完
for 1 to 2
i=1   1   --第一桶拿出來
i=2   x --不拿
i=1 3 -- 沒循環完，所以還要循環取
i=2   x --不拿
i=1 5

拉伸：
1   2   3   4
2 out of 8
for i=1 to 8 --只有4桶，但是循環8次，也就是到i5的時候，同數循環完了，但是循環的次數還沒有完，則繼續回到第一桶開始
i=1   1
2 2
3
4
i=5 1 --其實就是原來數據的第一桶
i=6 2    --其實就是原來數據的第二桶

需求：查詢所有id爲奇數的學生

--也就是將獲取餘數爲1的記錄
select * from buc2 tablesample(bucket 2 out of 2 on uid)
;

查詢：

limit 3--查3行
select * from part_tmp limit 3;--直接查詢3行
select * from part_tmp tablesample(3 rows);--查詢3行
select * from part_tmp tablesample(50 percent);--查詢50%
select * from part_tmp tablesample(50B);k M G  --字節讀取

查詢任意的3條數據
select * from t4 order by rand() limit 3;

分桶總結：

1、定義
clustered by (uid) -- 指定分桶的字段
sorted by (uid) -- 指定數據的排序規則，表示預期的數據就是以這種規則進行排序。

2、導數據時
cluster by(uid) --指定getPartition以哪個字段來進行hash，並且排序字段也是指定的字段，排序規則默認按正序進行排列
distribute by (uid) --指定getPartition以哪個字段來進行hash
sort by (age desc) --指定排序字段以及排序規則

cluster by(uid)
跟
distribute by (uid)
sort by (uid asc)
的執行效果是一樣的

distribute by (uid)
sort by (age desc)
的方式更靈活

分區下的分桶：（先對數據分區，在分區條件下分桶）

按照性別分區（1男2女），在分區下按照id的奇偶分桶

1   gyy1   1
2   gyy2   2
3   gyy3   2
4   gyy4   1
5   gyy5   1
6   gyy6   1
7   gyy7   2
8   gyy8   1
9   gyy9   2
10   gyy10   2
11   gyy11   1

1、2，寫sql

create table if not exists buc3(
id int,
name string
)
partitioned by(sex int)   --先分區，按sex分區
clustered by(id) into 2 buckets--後分桶，按id進行分桶
row format delimited 
fields terminated by '\t'  --按tab進行分隔
;

--因爲數據不能直接加載進來，所以需要創建一個臨時表
create table if not exists buc_tmp(
id int,
name string,
sex int
)
row format delimited 
fields terminated by '\t'
;

3、將數據放到某個目錄下面，然後將數據放到臨時表

4、將數據從臨時表導入到分區分桶表（上面已經分區了）

set mapreduce.job.reduces=2;--設置桶數
insert overwrite table buc3 partition(sex)
select * from buc_tmp 
cluster by (id)  --分桶查詢
;

5、查詢性別爲女、並且學號爲奇數的學生：（查詢奇數號的學生，通過性別進行過濾）

select *
from buc3 tablesample(bucket 2 out of 2 on id)
where sex=2
;

注意：
分區使用的表外字段(數據文件裏面沒有的字段)，分桶使用的表內字段(數據文件裏面有的字段)
分桶是更細粒度的管理數據，更多的用來做數據抽樣、JOIN操作

---------------------
複習：
使用場景
關鍵字
3、導數據

22HIVE的分區分桶——好程序

--分桶---------------------------

規律如下

分桶總結：

分區下的分桶：（先對數據分區，在分區條件下分桶）

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

22HIVE的分區分桶——好程序

06hadoop基礎架構——好程序

28hbase的內部機制&存儲機制&尋址機制——好程序

18mapreduce的案例加強——好程序

29hbase&hive&hdfs——好程序

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結