對謂詞下推的一點看法

謂詞下推

1. 謂詞下推概念

謂詞下推原本是一個關係型數據庫中的詞語，優化關係 SQL 查詢的一項基本技術是，將外層查詢塊的 WHERE 子句中的謂詞移入所包含的較低層查詢塊（例如視圖），從而能夠提早進行數據過濾以及有可能更好地利用索引。

2. Hive謂詞下推(Predicate pushdown):

Hive謂詞下推這個詞是從關係型數據庫借鑑來的，即使對Hive對來說相當於謂詞上推。

謂詞下推的基本思想：儘可能早的處理表達式(expressions)，默認產生的執行計劃在看到數據的地方添加過濾器filter。

2.1 hive文檔中的解釋

Preserved Row table（保留表）

在outer join中需要返回所有數據的表叫做保留表，也就是說在left outer join中，左表需要返回所有數據，則左表是保留表；right outer join中右表則是保留表；在full outer join中左表和右表都要返回所有數據，則左右表都是保留表。

Null Supplying table（空表）

在outer join中對於沒有匹配到的行需要用null來填充的表稱爲Null Supplying table。在left outer join中，左表的數據全返回，對於左表在右表中無法匹配的數據的相應列用null表示，則此時右表是Null Supplying table，相應的如果是right outer join的話，左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table，因爲左表和右表都會用null來填充無法匹配的數據。

During Join predicate（Join中的謂詞）

Join中的謂詞是指 Join On語句中的謂詞。如：R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的謂詞

After Join predicate（Join之後的謂詞）

where語句中的謂詞稱之爲Join之後的謂詞

3. 測試

3.1 建表

create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string)  PARTITIONED BY ( day string ) STORED AS ORC;

insert into table test1 partition (day='20190521') values(1,'張三');
insert into table test1 partition (day='20190521') values(2,'李四'); 
insert into table test1 partition (day='20190521') values(3,'王五');
insert into table test1 partition (day='20190521') values(1,'錢七');


insert into table test2 partition (day='20190521') values(1,'張三');
insert into table test2 partition (day='20190521') values(3,'趙六');

3.2 TESTING

select 
  count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from test2 a full join   test1 b on a.openid = b.openid  
where a.day = '20190521' and b.day = '20190521'

select 
    count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from (
    select * from test1 where day = '20190521'
) a full join (
    select * from test2  where day = '20190521'
)   b 
on a.openid = b.openid  
;

3.3 case1: left outer join 操作

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='錢七'  
and t2.openid='張三';

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:t2 
          TableScan
            alias: t2
            filterExpr: (openid = '張三') (type: boolean)
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '張三') (type: boolean)
              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
              Select Operator
                expressions: id (type: int), '張三' (type: string), day (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 270 Basic stats: COMPLETE Column stats: PARTIAL
                HashTable Sink Operator
                  filter predicates:
                    0 {(_col1 = '錢七')}
                    1 
                  keys:
                    0 _col0 (type: int)
                    1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 4 Data size: 736 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                filter predicates:
                  0 {(_col1 = '錢七')}
                  1 
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 4 Data size: 2248 Basic stats: COMPLETE Column stats: PARTIAL
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='錢七'  
and t2.openid='張三';

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t2 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t2 
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            HashTable Sink Operator
              keys:
                0 id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            filterExpr: (openid = '錢七') (type: boolean)
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (openid = '錢七') (type: boolean)
              Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col2, _col6, _col7, _col8
                Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: PARTIAL
                Filter Operator
                  predicate: (_col7 = '張三') (type: boolean)
                  Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
                  Select Operator
                    expressions: _col0 (type: int), '錢七' (type: string), _col2 (type: string), _col6 (type: int), '張三' (type: string), _col8 (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

3.4 case2 : full join操作

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='錢七'  
and t2.openid='張三';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 = '錢七')}
            1 {(VALUE._col0 = '張三')}
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 3456 Basic stats: COMPLETE Column stats: PARTIAL
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
            File Output Operator
              compressed: false
              Statistics: Num rows: 6 Data size: 2208 Basic stats: COMPLETE Column stats: PARTIAL
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='錢七'  
and t2.openid='張三';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 4 Data size: 1112 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 556 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 6 Data size: 2280 Basic stats: COMPLETE Column stats: PARTIAL
          Filter Operator
            predicate: ((_col1 = '錢七') and (_col7 = '張三')) (type: boolean)
            Statistics: Num rows: 1 Data size: 368 Basic stats: COMPLETE Column stats: PARTIAL
            Select Operator
              expressions: _col0 (type: int), '錢七' (type: string), _col2 (type: string), _col6 (type: int), '張三' (type: string), _col8 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 540 Basic stats: COMPLETE Column stats: PARTIAL
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

在這裏插入圖片描述

4. 總結

謂詞下推規則

5.求教

這種規則在hive2.x版本以後，就不是很準確了，hive2.x對CBO做了優化，CBO也對謂詞下推規則產生了一些影響。

因此在hive2.1.1中影響謂詞下推規則的，主要有兩方面

Hive邏輯執行計劃層面的優化
CBO（Cost based Optimizer）

6. 參考文獻

[1] https://blog.csdn.net/strongyoung88/article/details/81156271

[2] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

[3] https://blog.csdn.net/baichoufei90/article/details/85264100

對謂詞下推的一點看法

謂詞下推

1. 謂詞下推概念

2. Hive謂詞下推(Predicate pushdown):

2.1 hive文檔中的解釋

3. 測試

3.1 建表

3.2 TESTING

3.3 case1: left outer join 操作

3.4 case2 : full join操作

4. 總結

5.求教

6. 參考文獻

python gdal 安裝使用（Windows， python 3.6.8）

DataWorks之專有網絡中的MongoDB數據源打通

對謂詞下推的一點看法

Hive 查詢結果和insert結果不一致問題排查

安裝redis出錯 /bin/sh: cc: command not found

mysql一些優化方案

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結