謂詞下推
1. 謂詞下推概念
謂詞下推原本是一個關係型數據庫中的詞語,優化關係 SQL 查詢的一項基本技術是,將外層查詢塊的 WHERE 子句中的謂詞移入所包含的較低層查詢塊(例如視圖),從而能夠提早進行數據過濾以及有可能更好地利用索引。
2. Hive謂詞下推(Predicate pushdown):
Hive謂詞下推這個詞是從關係型數據庫借鑑來的,即使對Hive對來說相當於謂詞上推。
謂詞下推的基本思想:儘可能早的處理表達式(expressions),默認產生的執行計劃在看到數據的地方添加過濾器filter。
2.1 hive文檔中的解釋
- Preserved Row table(保留表)
在outer join中需要返回所有數據的表叫做保留表,也就是說在left outer join中,左表需要返回所有數據,則左表是保留表;right outer join中右表則是保留表;在full outer join中左表和右表都要返回所有數據,則左右表都是保留表。
- Null Supplying table(空表)
在outer join中對於沒有匹配到的行需要用null來填充的表稱爲Null Supplying table。在left outer join中,左表的數據全返回,對於左表在右表中無法匹配的數據的相應列用null表示,則此時右表是Null Supplying table,相應的如果是right outer join的話,左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table,因爲左表和右表都會用null來填充無法匹配的數據。
- During Join predicate(Join中的謂詞)
Join中的謂詞是指 Join On語句中的謂詞。如:R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的謂詞
- After Join predicate(Join之後的謂詞)
where語句中的謂詞稱之爲Join之後的謂詞
3. 測試
3.1 建表
create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
insert into table test1 partition (day='20190521') values(1,'張三');
insert into table test1 partition (day='20190521') values(2,'李四');
insert into table test1 partition (day='20190521') values(3,'王五');
insert into table test1 partition (day='20190521') values(1,'錢七');
insert into table test2 partition (day='20190521') values(1,'張三');
insert into table test2 partition (day='20190521') values(3,'趙六');
3.2 TESTING
select
count(distinct case when b.openid is null then a.openid end) as n1,
count(distinct case when a.openid is null then b.openid end) as n2
from test2 a full join test1 b on a.openid = b.openid
where a.day = '20190521' and b.day = '20190521'
select
count(distinct case when b.openid is null then a.openid end) as n1,
count(distinct case when a.openid is null then b.openid end) as n2
from (
select * from test1 where day = '20190521'
) a full join (
select * from test2 where day = '20190521'
) b
on a.openid = b.openid
;
3.3 case1: left outer join 操作
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='錢七'
and t2.openid='張三';
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_1:t2
TableScan
alias: t2
filterExpr: (openid = '張三') (type: boolean)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (openid = '張三') (type: boolean)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: int), '張三' (type: string), day (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 270 Basic stats: COMPLETE Column stats: PARTIAL
HashTable Sink Operator
filter predicates:
0 {(_col1 = '錢七')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: int), openid (type: string), day (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4 Data size: 736 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
filter predicates:
0 {(_col1 = '錢七')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='錢七'
and t2.openid='張三';
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t2
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
HashTable Sink Operator
keys:
0 id (type: int)
1 id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
filterExpr: (openid = '錢七') (type: boolean)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (openid = '錢七') (type: boolean)
Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col2, _col6, _col7, _col8
Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (_col7 = '張三') (type: boolean)
Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), '錢七' (type: string), _col2 (type: string), _col6 (type: int), '張三' (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
3.4 case2 : full join操作
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='錢七'
and t2.openid='張三';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
filter predicates:
0 {(VALUE._col0 = '錢七')}
1 {(VALUE._col0 = '張三')}
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
Statistics: Num rows: 6 Data size: 3456 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='錢七'
and t2.openid='張三';
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
TableScan
alias: t2
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: openid (type: string), day (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
Statistics: Num rows: 6 Data size: 2280 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: ((_col1 = '錢七') and (_col7 = '張三')) (type: boolean)
Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: int), '錢七' (type: string), _col2 (type: string), _col6 (type: int), '張三' (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
在這裏插入圖片描述
4. 總結
謂詞下推規則
5.求教
這種規則在hive2.x版本以後,就不是很準確了,hive2.x對CBO做了優化,CBO也對謂詞下推規則產生了一些影響。
因此在hive2.1.1中影響謂詞下推規則的,主要有兩方面
- Hive邏輯執行計劃層面的優化
- CBO(Cost based Optimizer)
6. 參考文獻
[1] https://blog.csdn.net/strongyoung88/article/details/81156271
[2] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
[3] https://blog.csdn.net/baichoufei90/article/details/85264100