PostgreSQL運維案例--記使用pg_pathman的range分區踩到的坑

一、問題背景

最近一測試環境某個postgres進程多次將主機內存耗盡,觸發了OOM,甚至導致主機多次重啓,一些服務中斷。從messages中OOM信息來看是進程佔用anon達數十GB。
該進程看起來就是執行一條簡單的select,如下:

考慮到信息安全紅線,文中做的sql演示中表名等信息均來自個人電腦,與平安業務無關

select * from qsump_pacloud_oscginfo_activity_detail_info_day where id ='zbngjih5xd' add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-08-19 00:00:00','yyyy-mm-dd hh24-mi-ss');

對該sql打印執行計劃後發現共掃描了20000+的分區表,使用的是pg_pathman的range分區。那麼就產生了兩個疑問:
(1)爲什麼沒有篩選出符合條件的分區表,而是掃描了所有的分區表?
(2)這個分區range爲7天,並且該表存儲的是18年到現在的數據,爲什麼會存在20000+個分區表?

對於問題1,之前遇見過類似的情況,已知在篩選條件中如果對於分區字段套用了函數表達式,或者類型轉換函數to_date(),to_timestamp()等,那麼不會篩選出對應的分區表,會掃描所有的分區表;但是支持::date或者::timestamp這種類型轉換 。

補充:
後續發現這種方式僅限psql、pgadmin、navicat客戶端;jdbc驅動使用::timestamp這種方式有時也會出現expr條件被解析爲T_FuncExpr 類型,不走選擇分區邏輯,建議java代碼中直接使用timestamp類型,去除類型轉換。

示例如下:

使用::timestamp方式,執行計劃中只掃描了查詢範圍內的兩個分區表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..71.00 rows=1360 width=12)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2  (cost=0.00..35.50 rows=680 width=12)
         Filter: (add_time >= '2020-01-09 00:00:00'::timestamp without time zone)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3  (cost=0.00..35.50 rows=680 width=12)
         Filter: (add_time < '2020-01-19 00:00:00'::timestamp without time zone)
(5 rows)
使用to_timestamp()方式,執行計劃中掃描了全部11個分區表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
 Append  (cost=0.00..558.80 rows=110 width=12)
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_1  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_4  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_5  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_6  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_7  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_8  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_9  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_10  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
   ->  Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_11  (cost=0.00..50.80 rows=10 width=12)
         Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
(23 rows)

postgres=#

所以將to_timestamp()改爲::timestamp的方式,問題臨時規避了,之前沒有進一步研究過原因,這篇案例後文主要分析問題1的根本原因。

再看問題2,爲什麼會存在20000+個分區表?通過表的relfilenode,查看對應的物理文件,發現這些文件都是前一天下午某個時刻創建的。找到對應時間的日誌,看到該表在某時刻插入了一條數據,但插入分區字段的值是一個未來很遙遠的時間。剛好使用到了range自動擴展,也就是說插入一條數據,如果超出當前所有分區的範圍,會自動創建新的分區,並補齊其中空缺的分區表。就是這一個insert,導致產生了20000+的分區表,真的是一條sql引發的慘案吶。

二、分區分析

那麼爲什麼會產生問題1中的現象?這個時候就能體現出開源的一些便利條件了,可以自己從源代碼中找答案。

衆所周知,pg_pathman是以HOOK的方式,來修改原本的querytree和plantree。postgresql源代碼中已經爲這些類似的HOOK插件留好了入口,在postgresql啓動時process_shared_preload_libraries()函數根據配置的插件名找到對應的lib,然後運行裏邊的pg_init()函數,pg_init()會做一些初始化,加載插件的HOOK函數,當業務邏輯走到HOOK入口時直接調用即可。

pg_pathman中怎麼確定需要的range分區表呢?是通過以下的函數來完成的

/* Given 'value' and 'ranges', return selected partitions list */
void
select_range_partitions(const Datum value,
						const Oid collid,
						FmgrInfo *cmp_func,
						const RangeEntry *ranges,
						const int nranges,
						const int strategy,
						WrapperNode *result) /* returned partitions */
{
 	/*函數體比較長,這裏省略了,後邊gdb跟蹤時會描述下大體的邏輯*/
}	

大致翻閱了下源代碼,雖然pg_pathman的代碼不多,但是對於筆者這樣一個代碼能力薄弱的人來說,裏邊的邏輯一時三刻無法理清,真的有點無從下手的感覺。所以選擇了最“笨重”,但是對自己來說最有效的辦法-gdb跟蹤。

三、GDB跟蹤

1. 準備工作

session1:執行sql
session2:跟蹤調試

查詢表的oid信息如下:

  oid  |                     relname
-------+-------------------------------------------------
 16781 | qsump_pacloud_oscginfo_activity_detail_info_day --主表,共有11個分區表
 16863 | qsump_pacloud_oscginfo_activity_detail_info_day_2
 16869 | qsump_pacloud_oscginfo_activity_detail_info_day_3
 ....
 16917 | qsump_pacloud_oscginfo_activity_detail_info_day_11
2. 調試::timestamp形式的語句

session1:

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          31698
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;

|

session2:
爲了觀察完整的過程,幾乎給pathman的所有HOOK函數,以及生成plantree的一些關鍵函數都設置了斷點,這裏只貼出選擇range分區邏輯的跟蹤過程

調試信息中,註釋格式標記爲:##註釋##

[postgres@postgres_zabbix ~]$ gdb --pid 31698
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 31698
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Loaded symbols for /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007fef912c15e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) b exec_simple_query   
Breakpoint 1 at 0x7b9bc2: file postgres.c, line 867.
(gdb) b pg_plan_queries   
Breakpoint 2 at 0x7b9b25: file postgres.c, line 834.
(gdb) b pathman_rel_pathlist_hook    
Breakpoint 3 at 0x7fef8a86594c: file src/hooks.c, line 263.
(gdb) b pathman_join_pathlist_hook
Breakpoint 4 at 0x7fef8a8652d7: file src/hooks.c, line 79.
(gdb) b pathman_shmem_startup_hook
Breakpoint 5 at 0x7fef8a866772: file src/hooks.c, line 687.
(gdb) b pathman_post_parse_analysis_hook
Breakpoint 6 at 0x7fef8a86650b: file src/hooks.c, line 587.
(gdb) b pathman_planner_hook
Breakpoint 7 at 0x7fef8a8662d3: file src/hooks.c, line 524.
(gdb) b pathman_process_utility_hook
Breakpoint 8 at 0x7fef8a86696b: file src/hooks.c, line 795.
(gdb) b pg_plan_query
Breakpoint 9 at 0x7b9a8e: file postgres.c, line 778.
(gdb) b planner
Breakpoint 10 at 0x6fc596: file planner.c, line 175.
(gdb) b add_partition_filters
Breakpoint 11 at 0x7fef8a86a713: file src/planner_tree_modification.c, line 378.
(gdb) b partition_filter_visitor
Breakpoint 12 at 0x7fef8a86a74a: file src/planner_tree_modification.c, line 390.
(gdb) b get_pathman_relation_info
Breakpoint 13 at 0x7fef8a851c05: file src/relation_info.c, line 361.
(gdb) b cache_parent_of_partition
Breakpoint 14 at 0x7fef8a852e6d: file src/relation_info.c, line 1015.
(gdb) b handle_modification_query
Breakpoint 15 at 0x7fef8a86a3bf: file src/planner_tree_modification.c, line 255.
(gdb) b select_range_partitions   
Breakpoint 16 at 0x7fef8a85940d: file src/pg_pathman.c, line 531.
(gdb) b walk_expr_tree           
Breakpoint 17 at 0x7fef8a85992d: file src/pg_pathman.c, line 717.
(gdb) b handle_opexpr               
Breakpoint 18 at 0x7fef8a85ab0e: file src/pg_pathman.c, line 1317.
(gdb) b IsConstValue
Breakpoint 19 at 0x7fef8a858811: file src/pg_pathman.c, line 114.
(gdb) n
(gdb) set print pretty                       

調用select_range_partitions的關鍵函數

(gdb) list                      ##打印handle_opexpr 函數體##
1328                            int                             strategy;
1329
1330                            tce = lookup_type_cache(prel->ev_type, TYPECACHE_BTREE_OPFAMILY);
1331                            strategy = get_op_opfamily_strategy(expr->opno, tce->btree_opf);
1332                            ##當IsConstValue爲true時,調用handle_const函數##
1333                            if (IsConstValue(param, context))
1334                            {
1335                                    handle_const(ExtractConst(param, context),
1336                                                             expr->inputcollid,
1337                                                             strategy, context, result);
(gdb) n
Breakpoint 19, IsConstValue (node=0x20682f0, context=0x7ffe113be790) at src/pg_pathman.c:114   ##進入斷點19,校驗我們sql中>=<條件是否是T_Const類型##
114             switch (nodeTag(node))
(gdb) p *node   ##可以看到我們傳入的expr node_type爲T_Const##
$4 = {
  type = T_Const
}
(gdb) list     ##打印函數體##
109
110     /* Can we transform this node into a Const? */
111     static bool
112     IsConstValue(Node *node, const WalkerContext *context)
113     {
114             switch (nodeTag(node))
115             {       ##當類型爲T_Const時返回true##
116                     case T_Const:   
117                             return true;
118


(gdb) list
119                     case T_Param:
120                             return WcxtHasExprContext(context);
121
122                     case T_RowExpr:
123                             {
124                                     RowExpr    *row = (RowExpr *) node;
125                                     ListCell   *lc;
126
127                                     /* Can't do anything about RECORD of wrong type */
128                                     if (row->row_typeid != context->prel->ev_type)
(gdb) n
117                             return true;
(gdb)
141     }
(gdb)   ##函數返回了ture,進入handle_const,準備調用select_range_partitions## 
handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
1335                                    handle_const(ExtractConst(param, context),
(gdb)      ##進入了select_range_partitions函數##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:531
(gdb) bt   ##打印堆棧信息,看看函數調用關係##
#0  select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:540
#1  0x00007fef8a859ea3 in handle_const (c=0x20682f0, collid=0, strategy=4, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:929
#2  0x00007fef8a85abd7 in handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
#3  0x00007fef8a8599d4 in walk_expr_tree (expr=0x2068360, context=0x7ffe113be790) at src/pg_pathman.c:734
#4  0x00007fef8a865bfd in pathman_rel_pathlist_hook (root=0x2067e40, rel=0x206fcc8, rti=1, rte=0x20674b8) at src/hooks.c:345   

    ##省略底層堆棧信息## 

select_range_partitions函數中選定分區表的邏輯

(gdb) 
      ##這裏的value就是我們的查找範圍的左區間,也就是>=的值'2020-01-09 00:00:00',可以看到nranges=11,即存在11個分區表##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
    result=0x2070a10) at src/pg_pathman.c:531
531             bool    lossy = false,
(gdb)
540             int             startidx = 0,
(gdb) bt

(gdb)
541                             endidx = nranges - 1,
(gdb)
546             Bound   value_bound = MakeBound(value); /* convert value to Bound */
(gdb)
550             result->found_gap = false;
(gdb)
(gdb) p *cmp_func      ##比較大小使用的函數爲timestamp_cmp##
$40 = {
  fn_addr = 0x8b2ffa <timestamp_cmp>, 
  fn_oid = 2045,
  fn_nargs = 2,
  fn_strict = 1 '\001',
  fn_retset = 0 '\000',
  fn_stats = 2 '\002',
  fn_extra = 0x0,
  fn_mcxt = 0x1f82948,
  fn_expr = 0x0

(gdb)
553             if (nranges == 0)
(gdb)                   ##cmp_func函數爲timestamp_cmp,返回值爲 return (dt1 < dt2) ? -1 : ((dt1 > dt2) ? 1 : 0);##
						##cmp_bounds是個回調函數,調用cmp_func函數來確定value_bound(也就是我們sql中的查詢範圍常量即>=<的值)和分區表的rang_min及rang_max之間的大小關係##
						##cmp_min爲-1,說明value_bound小於ranges[i].min##
						##cmp_max爲-1,說明value_bound小於ranges[i].max,這兩個合起來就可以確定當前的ranges[i]是不是要找的分區表##
						 
						##這裏很巧妙,先比較ranges[startidx].min也就是第一個分區表的左區間以及ranges[endidx].max也就是最後一個分區表的右區間,確認要找的值在不在整個分區範圍內,若不在後邊直接返回了,若在則繼續輪巡比較##
566                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[startidx].min);
(gdb)                  
567                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[endidx].max);
(gdb)
569                     if ((cmp_min <= 0 && strategy == BTLessStrategyNumber) ||
(gdb) p nranges
$35 = 11                ##共11個分區表##
(gdb) p value_bound     ##當前的範圍常量是sql中>=後的值##
$36 = {
  value = 631843200000000,
  is_infinite = 0 '\000'
}
(gdb) p ranges[startidx] ##rang[0]爲第一個分區表##
$37 = {
  child_oid = 16857,
  min = {
    value = 631152000000000,  
    is_infinite = 0 '\000'
  },
  max = {
    value = 631756800000000,
    is_infinite = 0 '\000'
  }
}
(gdb) n


(gdb)
646                     else if (is_greater)
(gdb)
647                             startidx = i + 1;
(gdb)
651             }
(gdb)                   ##這裏就開始操作分區下標了,類似對分區下標做二分查找##
611                     i = startidx + (endidx - startidx) / 2;  
(gdb)
615                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb)                   ##is_less 若爲假,則說明左值(>=條件的值),大於當前分區的左區間range_min##
618                     is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber)); 
(gdb)                   ##is_greater若爲假,說明左值(>=條件的值),小於當前分區的右區間range_max##
619                     is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb)                   ##if條件爲真,說明這裏已經找到了查詢的左區間的分區表##      
621                     if (!is_less && !is_greater) 
(gdb)
623                             if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625                             else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)                                 
628                                     lossy = true;      
(gdb) p ranges[i]
$38 = {
  child_oid = 16863,                   ##查詢左區間位於分區表oid=16863內,即qsump_pacloud_oscginfo_activity_detail_info_day_2##
  min = {
    value = 631756800000000,
    is_infinite = 0 '\000'
  },
  max = {
    value = 632361600000000,
    is_infinite = 0 '\000'
  }
}
(gdb) n
633                             break;
(gdb)
657             switch(strategy)
(gdb)
680                             if (lossy)
(gdb)
682                                     result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb) n
683                                     if (i < nranges - 1)
(gdb)                                   ##將匹配到的左區間加入到result->rangeset node## 
685                                     lappend_irange(result->rangeset,
(gdb)
684                                             result->rangeset =
(gdb)
697                             break;

(gdb)##這裏開始匹配右區間,value值爲sql中< 條件的值'2020-01-19 00:00:00'
Breakpoint 20, select_range_partitions (value=632707200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=1,
    result=0x2036d30) at src/pg_pathman.c:531
531             bool    lossy = false,


(gdb) n
611                     i = startidx + (endidx - startidx) / 2;
(gdb)


(gdb)
615                     cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616                     cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb)                   ##is_less 若爲假,則說明左值(<條件的值),大於當前分區的左區間
618                     is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));
(gdb)                   ##is_greater若爲假,則說明左值(<條件的值),小於當前分區的右區間range_max##
619                     is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb)
(gdb) p is_less
$45 = 0 '\000'
(gdb) p is_greater
$46 = 0 '\000'
(gdb)                   ##if條件爲真,說明找到了查詢右區間的分區表
621                     if (!is_less && !is_greater)
(gdb)
623                             if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625                             else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)
628                                     lossy = true;
(gdb)
(gdb) p i
$44 = 2
(gdb)
(gdb) p ranges[i]             ##找到的右區間爲oid=16869,即qsump_pacloud_oscginfo_activity_detail_info_day_3
$43 = {
  child_oid = 16869,
  min = {
    value = 632361600000000,
    is_infinite = 0 '\000'
  },
  max = {
    value = 632966400000000,
    is_infinite = 0 '\000'
  }
}

633                             break;
(gdb)
657             switch(strategy)
(gdb)
661                             if (lossy)
(gdb)                            ##將匹配的右區間加入到result->rangeset node
663                                     result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb)
664                                     if (i > 0)

##到這裏已經匹配到了需要查詢的所有分區表##

下來就是將分區表插入到root node。可以看到循環執行了兩次,通過append_child_relation函數將匹配到的兩個分區表加入到了root node

384                     parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387                     if (prel->enable_parent)
(gdb)
393                     foreach(lc, ranges)
(gdb)
395                             IndexRange irange = lfirst_irange(lc);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393                     foreach(lc, ranges)
(gdb) p i
$49 = 3
(gdb) p children[2]
$51 = 16869
(gdb)
(gdb) p children[1]
$52 = 16863
(gdb)

省略了代價預估和get_cheapest_fractional_path的跟蹤環節,直接來看最終的plantree


pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792             if (log_planner_stats)
(gdb)           ##如果開啓了配置參數Debug_print_plan,將會將完整的plantree打印到日誌
817             if (Debug_print_plan)
(gdb)                   
818                     elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$59 = {
  type = T_PlannedStmt,
  commandType = CMD_SELECT,
  queryId = 0,
  hasReturning = 0 '\000',
  hasModifyingCTE = 0 '\000',
  canSetTag = 1 '\001',
  transientPlan = 0 '\000',
  dependsOnRole = 0 '\000',
  parallelModeNeeded = 0 '\000',
  planTree = 0x206fde8,
  rtable = 0x20700b8,
  resultRelations = 0x0,
  utilityStmt = 0x0,
  subplans = 0x0,
  rewindPlanIDs = 0x0,
  rowMarks = 0x0,
  relationOids = 0x2070108,
  invalItems = 0x0,
  nParamExec = 0
} ##可以看到最終的plan->relationOids list中包含三個node,即主表和兩個分區表,和我們之前看到“好的”執行計劃結果是相符的##
(gdb) p *plan->relationOids
$60 = {
  type = T_OidList,
  length = 3,
  head = 0x20700e8,
  tail = 0x2070428
}
(gdb) p *plan->relationOids->head ##主表##
$61 = {
  data = {
    ptr_value = 0x418d,
    int_value = 16781,
    oid_value = 16781
  },
  next = 0x20702d8
}     
		##分區表qsump_pacloud_oscginfo_activity_detail_info_day_2##
(gdb) p *plan->relationOids->head->next 
$62 = {
  data = {
    ptr_value = 0x41df,
    int_value = 16863,
    oid_value = 16863
  },
  next = 0x2070428
}
		##分區表qsump_pacloud_oscginfo_activity_detail_info_day_3##
(gdb) p *plan->relationOids->tail
$63 = {
  data = {
    ptr_value = 0x41e5,
    int_value = 16869,
    oid_value = 16869
  },
  next = 0x0
}
(gdb) n
822             return plan;
(gdb)
823     }
	##跟蹤結束##
3. 調試to_timestamp()形式的語句

這裏只體現與之前不同的部分

session1:

postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
|

session2:

##這裏與之前不同,之前node_type爲T_Const,而當前爲T_FuncExpr
Breakpoint 20, IsConstValue (node=0x2071350, context=0x7ffe113be790) at src/pg_pathman.c:114
114             switch (nodeTag(node))
(gdb) p *node
$2 = {
  type = T_FuncExpr
}
(gdb) list
109
110     /* Can we transform this node into a Const? */
111     static bool
112     IsConstValue(Node *node, const WalkerContext *context)
113     {
114             switch (nodeTag(node))
115             {
116                     case T_Const:
117                             return true;
118
(gdb)
119                     case T_Param:
120                             return WcxtHasExprContext(context);
121
122                     case T_RowExpr:
123                             {
124                                     RowExpr    *row = (RowExpr *) node;
125                                     ListCell   *lc;
126
127                                     /* Can't do anything about RECORD of wrong type */
128                                     if (row->row_typeid != context->prel->ev_type)
(gdb)
129                                             return false;
130
131                                     /* Check that args are const values */
132                                     foreach (lc, row->args)
133                                             if (!IsConstValue((Node *) lfirst(lc), context))
134                                                     return false;
135                             }
136                             return true;
137                    ##IsConstValue函數中,並沒有對T_FuncExpr做一個case分支去處理,因此,走了default,返回了false
138                     default:
(gdb)
139                             return false;
140             }
141     }
142
143     /* Extract a Const from node that has been checked by IsConstValue() */
144     static Const *
145     ExtractConst(Node *node, const WalkerContext *context)
146     {
147             ExprState          *estate;
148             ExprContext        *econtext = context->econtext;
(gdb) n
139                             return false;
(gdb)
141     }
(gdb)
handle_opexpr (expr=0x20713c0, context=0x7ffe113be790, result=0x2079e40) at src/pg_pathman.c:1342
1342                            else if (IsA(param, Param) || IsA(param, Var))
(gdb) n
                ##由於IsConstValue返回了false,因此沒有進入handle_const裏調用select_range_partitions選擇分區表,而是直接將所有分區追加到result->rangeset node##
1352            result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
(gdb) list
1347                                    return; /* done, exit */
1348                            }
1349                    }
1350            }
1351
1352            result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
1353            result->paramsel = 1.0;
1354    }
1355
1356
(gdb) n
1353            result->paramsel = 1.0;
(gdb)
1354    }
(gdb)

所有的分區表被追加到root node,而我們的where條件僅僅當做filter去處理,並沒有先根據條件選擇分區表。

						##11個分區表均被追加到root node##
(gdb)
384                     parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387                     if (prel->enable_parent)
(gdb)
393                     foreach(lc, ranges)
(gdb)
395                             IndexRange irange = lfirst_irange(lc);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398                                     append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397                             for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393                     foreach(lc, ranges)
(gdb) p i
$3 = 11

直接來看最後生成的plantree

(gdb)
pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792             if (log_planner_stats)
(gdb)
817             if (Debug_print_plan)
(gdb)
818                     elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$5 = {
  type = T_PlannedStmt,
  commandType = CMD_SELECT,
  queryId = 0,
  hasReturning = 0 '\000',
  hasModifyingCTE = 0 '\000',
  canSetTag = 1 '\001',
  transientPlan = 0 '\000',
  dependsOnRole = 0 '\000',
  parallelModeNeeded = 0 '\000',
  planTree = 0x208f028,
  rtable = 0x208f2f8,
  resultRelations = 0x0,
  utilityStmt = 0x0,
  subplans = 0x0,
  rewindPlanIDs = 0x0,
  rowMarks = 0x0,
  relationOids = 0x208f348,
  invalItems = 0x0,
  nParamExec = 0
}
 		##可以看到,最終的plan->relationOids list包含12個node,即主表加11個分區表##
(gdb) p *plan->relationOids
$4 = {
  type = T_OidList,
  length = 12,        
  head = 0x2091338,
  tail = 0x2092218
}
(gdb) p *plan->relationOids->head  ##主表##
$5 = {
  data = {
    ptr_value = 0x418d,
    int_value = 16781,
    oid_value = 16781
  },
  next = 0x20914b8
}
##第11個分區表qsump_pacloud_oscginfo_activity_detail_info_day_11##
(gdb) p *plan->relationOids->tail
$6 = {
  data = {
    ptr_value = 0x4215,
    int_value = 16917,
    oid_value = 16917
  },
  next = 0x0
}
(gdb)

四、總結反思

1. 修改思路

通過gdb跟蹤,現在已經明確了是IsConstValue函數中,不存在T_FuncExpr case分支,導致T_FuncExpr類型直接走了default,沒有進行分區表的篩選。

那麼修改方案是否可以爲:
1)IsConstValue函數中加入T_FuncExpr case分支處理,實現分區的篩選
2)大致翻閱了下postgresql主體的源代碼,發現主體代碼中存在很多處理方式,比如將T_FuncExpr轉換爲simple_expr。那麼pg_pathman中能否對node type做下處理,將T_FuncExpr轉化爲T_Const

測試了目前最新的pg_pathman版本 1.5.11存在同樣的問題,但是postgresql原生的聲明式分區不存在這樣的問題,都可以處理

/* Can we transform this node into a Const? */
static bool
IsConstValue(Node *node, const WalkerContext *context)
{
	switch (nodeTag(node))
	{
		case T_Const:
			return true;

		case T_Param:
			return WcxtHasExprContext(context);

		case T_RowExpr:
			{
				RowExpr	   *row = (RowExpr *) node;
				ListCell   *lc;

				/* Can't do anything about RECORD of wrong type */
				if (row->row_typeid != context->prel->ev_type)
					return false;

				/* Check that args are const values */
				foreach (lc, row->args)
					if (!IsConstValue((Node *) lfirst(lc), context))
						return false;
			}
			return true;

		default:
			return false;
	}
}
2. 修改的必要性

這是一個相對性看待的問題,我認爲對於應用開發同學來說,有些不友好,只能使用::date這種方式。
請教過社區的專家,得知postgresql12原生聲明式分區的性能已經可以和pg_pathman媲美,幾乎不相上下,因此對於12及以後版本可以考慮優先使用聲明式分區,拋棄插件。但是11以前到9.5之間的版本,由於自身分區性能較差,可能還是需要pg_pathman來實現。

筆者是一名剛入坑不久的小菜鳥,對於postgresql代碼不熟悉,再加自身代碼能力薄弱,修改方案只能提出自己的一些小想法,已經將問題反饋給了公司數據庫專家團隊。同時希望社區的前輩,專家如果已有更好的解決方案,麻煩共享出來,算是爲大家,爲社區做了貢獻。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章