一、問題背景
最近一測試環境某個postgres進程多次將主機內存耗盡,觸發了OOM,甚至導致主機多次重啓,一些服務中斷。從messages中OOM信息來看是進程佔用anon達數十GB。
該進程看起來就是執行一條簡單的select,如下:
考慮到信息安全紅線,文中做的sql演示中表名等信息均來自個人電腦,與平安業務無關
select * from qsump_pacloud_oscginfo_activity_detail_info_day where id ='zbngjih5xd' add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-08-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
對該sql打印執行計劃後發現共掃描了20000+的分區表,使用的是pg_pathman的range分區。那麼就產生了兩個疑問:
(1)爲什麼沒有篩選出符合條件的分區表,而是掃描了所有的分區表?
(2)這個分區range爲7天,並且該表存儲的是18年到現在的數據,爲什麼會存在20000+個分區表?
對於問題1,之前遇見過類似的情況,已知在篩選條件中如果對於分區字段套用了函數表達式,或者類型轉換函數to_date(),to_timestamp()等,那麼不會篩選出對應的分區表,會掃描所有的分區表;但是支持::date或者::timestamp這種類型轉換 。
補充:
後續發現這種方式僅限psql、pgadmin、navicat客戶端;jdbc驅動使用::timestamp這種方式有時也會出現expr條件被解析爲T_FuncExpr 類型,不走選擇分區邏輯,建議java代碼中直接使用timestamp類型,去除類型轉換。
示例如下:
使用::timestamp方式,執行計劃中只掃描了查詢範圍內的兩個分區表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Append (cost=0.00..71.00 rows=1360 width=12)
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2 (cost=0.00..35.50 rows=680 width=12)
Filter: (add_time >= '2020-01-09 00:00:00'::timestamp without time zone)
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3 (cost=0.00..35.50 rows=680 width=12)
Filter: (add_time < '2020-01-19 00:00:00'::timestamp without time zone)
(5 rows)
使用to_timestamp()方式,執行計劃中掃描了全部11個分區表
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
Append (cost=0.00..558.80 rows=110 width=12)
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_1 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_4 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_5 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_6 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_7 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_8 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_9 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_10 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_11 (cost=0.00..50.80 rows=10 width=12)
Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-19
00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))
(23 rows)
postgres=#
所以將to_timestamp()改爲::timestamp的方式,問題臨時規避了,之前沒有進一步研究過原因,這篇案例後文主要分析問題1的根本原因。
再看問題2,爲什麼會存在20000+個分區表?通過表的relfilenode,查看對應的物理文件,發現這些文件都是前一天下午某個時刻創建的。找到對應時間的日誌,看到該表在某時刻插入了一條數據,但插入分區字段的值是一個未來很遙遠的時間。剛好使用到了range自動擴展,也就是說插入一條數據,如果超出當前所有分區的範圍,會自動創建新的分區,並補齊其中空缺的分區表。就是這一個insert,導致產生了20000+的分區表,真的是一條sql引發的慘案吶。
二、分區分析
那麼爲什麼會產生問題1中的現象?這個時候就能體現出開源的一些便利條件了,可以自己從源代碼中找答案。
衆所周知,pg_pathman是以HOOK的方式,來修改原本的querytree和plantree。postgresql源代碼中已經爲這些類似的HOOK插件留好了入口,在postgresql啓動時process_shared_preload_libraries()函數根據配置的插件名找到對應的lib,然後運行裏邊的pg_init()函數,pg_init()會做一些初始化,加載插件的HOOK函數,當業務邏輯走到HOOK入口時直接調用即可。
pg_pathman中怎麼確定需要的range分區表呢?是通過以下的函數來完成的
/* Given 'value' and 'ranges', return selected partitions list */
void
select_range_partitions(const Datum value,
const Oid collid,
FmgrInfo *cmp_func,
const RangeEntry *ranges,
const int nranges,
const int strategy,
WrapperNode *result) /* returned partitions */
{
/*函數體比較長,這裏省略了,後邊gdb跟蹤時會描述下大體的邏輯*/
}
大致翻閱了下源代碼,雖然pg_pathman的代碼不多,但是對於筆者這樣一個代碼能力薄弱的人來說,裏邊的邏輯一時三刻無法理清,真的有點無從下手的感覺。所以選擇了最“笨重”,但是對自己來說最有效的辦法-gdb跟蹤。
三、GDB跟蹤
1. 準備工作
session1:執行sql
session2:跟蹤調試
查詢表的oid信息如下:
oid | relname
-------+-------------------------------------------------
16781 | qsump_pacloud_oscginfo_activity_detail_info_day --主表,共有11個分區表
16863 | qsump_pacloud_oscginfo_activity_detail_info_day_2
16869 | qsump_pacloud_oscginfo_activity_detail_info_day_3
....
16917 | qsump_pacloud_oscginfo_activity_detail_info_day_11
2. 調試::timestamp形式的語句
session1:
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
31698
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;
|
session2:
爲了觀察完整的過程,幾乎給pathman的所有HOOK函數,以及生成plantree的一些關鍵函數都設置了斷點,這裏只貼出選擇range分區邏輯的跟蹤過程
調試信息中,註釋格式標記爲:##註釋##
[postgres@postgres_zabbix ~]$ gdb --pid 31698
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 31698
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Loaded symbols for /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007fef912c15e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) b exec_simple_query
Breakpoint 1 at 0x7b9bc2: file postgres.c, line 867.
(gdb) b pg_plan_queries
Breakpoint 2 at 0x7b9b25: file postgres.c, line 834.
(gdb) b pathman_rel_pathlist_hook
Breakpoint 3 at 0x7fef8a86594c: file src/hooks.c, line 263.
(gdb) b pathman_join_pathlist_hook
Breakpoint 4 at 0x7fef8a8652d7: file src/hooks.c, line 79.
(gdb) b pathman_shmem_startup_hook
Breakpoint 5 at 0x7fef8a866772: file src/hooks.c, line 687.
(gdb) b pathman_post_parse_analysis_hook
Breakpoint 6 at 0x7fef8a86650b: file src/hooks.c, line 587.
(gdb) b pathman_planner_hook
Breakpoint 7 at 0x7fef8a8662d3: file src/hooks.c, line 524.
(gdb) b pathman_process_utility_hook
Breakpoint 8 at 0x7fef8a86696b: file src/hooks.c, line 795.
(gdb) b pg_plan_query
Breakpoint 9 at 0x7b9a8e: file postgres.c, line 778.
(gdb) b planner
Breakpoint 10 at 0x6fc596: file planner.c, line 175.
(gdb) b add_partition_filters
Breakpoint 11 at 0x7fef8a86a713: file src/planner_tree_modification.c, line 378.
(gdb) b partition_filter_visitor
Breakpoint 12 at 0x7fef8a86a74a: file src/planner_tree_modification.c, line 390.
(gdb) b get_pathman_relation_info
Breakpoint 13 at 0x7fef8a851c05: file src/relation_info.c, line 361.
(gdb) b cache_parent_of_partition
Breakpoint 14 at 0x7fef8a852e6d: file src/relation_info.c, line 1015.
(gdb) b handle_modification_query
Breakpoint 15 at 0x7fef8a86a3bf: file src/planner_tree_modification.c, line 255.
(gdb) b select_range_partitions
Breakpoint 16 at 0x7fef8a85940d: file src/pg_pathman.c, line 531.
(gdb) b walk_expr_tree
Breakpoint 17 at 0x7fef8a85992d: file src/pg_pathman.c, line 717.
(gdb) b handle_opexpr
Breakpoint 18 at 0x7fef8a85ab0e: file src/pg_pathman.c, line 1317.
(gdb) b IsConstValue
Breakpoint 19 at 0x7fef8a858811: file src/pg_pathman.c, line 114.
(gdb) n
(gdb) set print pretty
調用select_range_partitions的關鍵函數
(gdb) list ##打印handle_opexpr 函數體##
1328 int strategy;
1329
1330 tce = lookup_type_cache(prel->ev_type, TYPECACHE_BTREE_OPFAMILY);
1331 strategy = get_op_opfamily_strategy(expr->opno, tce->btree_opf);
1332 ##當IsConstValue爲true時,調用handle_const函數##
1333 if (IsConstValue(param, context))
1334 {
1335 handle_const(ExtractConst(param, context),
1336 expr->inputcollid,
1337 strategy, context, result);
(gdb) n
Breakpoint 19, IsConstValue (node=0x20682f0, context=0x7ffe113be790) at src/pg_pathman.c:114 ##進入斷點19,校驗我們sql中>=和<條件是否是T_Const類型##
114 switch (nodeTag(node))
(gdb) p *node ##可以看到我們傳入的expr node_type爲T_Const##
$4 = {
type = T_Const
}
(gdb) list ##打印函數體##
109
110 /* Can we transform this node into a Const? */
111 static bool
112 IsConstValue(Node *node, const WalkerContext *context)
113 {
114 switch (nodeTag(node))
115 { ##當類型爲T_Const時返回true##
116 case T_Const:
117 return true;
118
(gdb) list
119 case T_Param:
120 return WcxtHasExprContext(context);
121
122 case T_RowExpr:
123 {
124 RowExpr *row = (RowExpr *) node;
125 ListCell *lc;
126
127 /* Can't do anything about RECORD of wrong type */
128 if (row->row_typeid != context->prel->ev_type)
(gdb) n
117 return true;
(gdb)
141 }
(gdb) ##函數返回了ture,進入handle_const,準備調用select_range_partitions##
handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
1335 handle_const(ExtractConst(param, context),
(gdb) ##進入了select_range_partitions函數##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
result=0x2070a10) at src/pg_pathman.c:531
(gdb) bt ##打印堆棧信息,看看函數調用關係##
#0 select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
result=0x2070a10) at src/pg_pathman.c:540
#1 0x00007fef8a859ea3 in handle_const (c=0x20682f0, collid=0, strategy=4, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:929
#2 0x00007fef8a85abd7 in handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335
#3 0x00007fef8a8599d4 in walk_expr_tree (expr=0x2068360, context=0x7ffe113be790) at src/pg_pathman.c:734
#4 0x00007fef8a865bfd in pathman_rel_pathlist_hook (root=0x2067e40, rel=0x206fcc8, rti=1, rte=0x20674b8) at src/hooks.c:345
##省略底層堆棧信息##
select_range_partitions函數中選定分區表的邏輯
(gdb)
##這裏的value就是我們的查找範圍的左區間,也就是>=的值'2020-01-09 00:00:00',可以看到nranges=11,即存在11個分區表##
Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,
result=0x2070a10) at src/pg_pathman.c:531
531 bool lossy = false,
(gdb)
540 int startidx = 0,
(gdb) bt
(gdb)
541 endidx = nranges - 1,
(gdb)
546 Bound value_bound = MakeBound(value); /* convert value to Bound */
(gdb)
550 result->found_gap = false;
(gdb)
(gdb) p *cmp_func ##比較大小使用的函數爲timestamp_cmp##
$40 = {
fn_addr = 0x8b2ffa <timestamp_cmp>,
fn_oid = 2045,
fn_nargs = 2,
fn_strict = 1 '\001',
fn_retset = 0 '\000',
fn_stats = 2 '\002',
fn_extra = 0x0,
fn_mcxt = 0x1f82948,
fn_expr = 0x0
(gdb)
553 if (nranges == 0)
(gdb) ##cmp_func函數爲timestamp_cmp,返回值爲 return (dt1 < dt2) ? -1 : ((dt1 > dt2) ? 1 : 0);##
##cmp_bounds是個回調函數,調用cmp_func函數來確定value_bound(也就是我們sql中的查詢範圍常量即>=和<的值)和分區表的rang_min及rang_max之間的大小關係##
##cmp_min爲-1,說明value_bound小於ranges[i].min##
##cmp_max爲-1,說明value_bound小於ranges[i].max,這兩個合起來就可以確定當前的ranges[i]是不是要找的分區表##
##這裏很巧妙,先比較ranges[startidx].min也就是第一個分區表的左區間以及ranges[endidx].max也就是最後一個分區表的右區間,確認要找的值在不在整個分區範圍內,若不在後邊直接返回了,若在則繼續輪巡比較##
566 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[startidx].min);
(gdb)
567 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[endidx].max);
(gdb)
569 if ((cmp_min <= 0 && strategy == BTLessStrategyNumber) ||
(gdb) p nranges
$35 = 11 ##共11個分區表##
(gdb) p value_bound ##當前的範圍常量是sql中>=後的值##
$36 = {
value = 631843200000000,
is_infinite = 0 '\000'
}
(gdb) p ranges[startidx] ##rang[0]爲第一個分區表##
$37 = {
child_oid = 16857,
min = {
value = 631152000000000,
is_infinite = 0 '\000'
},
max = {
value = 631756800000000,
is_infinite = 0 '\000'
}
}
(gdb) n
(gdb)
646 else if (is_greater)
(gdb)
647 startidx = i + 1;
(gdb)
651 }
(gdb) ##這裏就開始操作分區下標了,類似對分區下標做二分查找##
611 i = startidx + (endidx - startidx) / 2;
(gdb)
615 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb) ##is_less 若爲假,則說明左值(>=條件的值),大於當前分區的左區間range_min##
618 is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));
(gdb) ##is_greater若爲假,說明左值(>=條件的值),小於當前分區的右區間range_max##
619 is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb) ##if條件爲真,說明這裏已經找到了查詢的左區間的分區表##
621 if (!is_less && !is_greater)
(gdb)
623 if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625 else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)
628 lossy = true;
(gdb) p ranges[i]
$38 = {
child_oid = 16863, ##查詢左區間位於分區表oid=16863內,即qsump_pacloud_oscginfo_activity_detail_info_day_2##
min = {
value = 631756800000000,
is_infinite = 0 '\000'
},
max = {
value = 632361600000000,
is_infinite = 0 '\000'
}
}
(gdb) n
633 break;
(gdb)
657 switch(strategy)
(gdb)
680 if (lossy)
(gdb)
682 result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb) n
683 if (i < nranges - 1)
(gdb) ##將匹配到的左區間加入到result->rangeset node##
685 lappend_irange(result->rangeset,
(gdb)
684 result->rangeset =
(gdb)
697 break;
(gdb)##這裏開始匹配右區間,value值爲sql中< 條件的值'2020-01-19 00:00:00'
Breakpoint 20, select_range_partitions (value=632707200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=1,
result=0x2036d30) at src/pg_pathman.c:531
531 bool lossy = false,
(gdb) n
611 i = startidx + (endidx - startidx) / 2;
(gdb)
(gdb)
615 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);
(gdb)
616 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);
(gdb) ##is_less 若爲假,則說明左值(<條件的值),大於當前分區的左區間
618 is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));
(gdb) ##is_greater若爲假,則說明左值(<條件的值),小於當前分區的右區間range_max##
619 is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));
(gdb)
(gdb) p is_less
$45 = 0 '\000'
(gdb) p is_greater
$46 = 0 '\000'
(gdb) ##if條件爲真,說明找到了查詢右區間的分區表
621 if (!is_less && !is_greater)
(gdb)
623 if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)
(gdb)
625 else if (strategy == BTLessStrategyNumber && cmp_max == 0)
(gdb)
628 lossy = true;
(gdb)
(gdb) p i
$44 = 2
(gdb)
(gdb) p ranges[i] ##找到的右區間爲oid=16869,即qsump_pacloud_oscginfo_activity_detail_info_day_3
$43 = {
child_oid = 16869,
min = {
value = 632361600000000,
is_infinite = 0 '\000'
},
max = {
value = 632966400000000,
is_infinite = 0 '\000'
}
}
633 break;
(gdb)
657 switch(strategy)
(gdb)
661 if (lossy)
(gdb) ##將匹配的右區間加入到result->rangeset node
663 result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));
(gdb)
664 if (i > 0)
##到這裏已經匹配到了需要查詢的所有分區表##
下來就是將分區表插入到root node。可以看到循環執行了兩次,通過append_child_relation函數將匹配到的兩個分區表加入到了root node
384 parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387 if (prel->enable_parent)
(gdb)
393 foreach(lc, ranges)
(gdb)
395 IndexRange irange = lfirst_irange(lc);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393 foreach(lc, ranges)
(gdb) p i
$49 = 3
(gdb) p children[2]
$51 = 16869
(gdb)
(gdb) p children[1]
$52 = 16863
(gdb)
省略了代價預估和get_cheapest_fractional_path的跟蹤環節,直接來看最終的plantree
pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792 if (log_planner_stats)
(gdb) ##如果開啓了配置參數Debug_print_plan,將會將完整的plantree打印到日誌
817 if (Debug_print_plan)
(gdb)
818 elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$59 = {
type = T_PlannedStmt,
commandType = CMD_SELECT,
queryId = 0,
hasReturning = 0 '\000',
hasModifyingCTE = 0 '\000',
canSetTag = 1 '\001',
transientPlan = 0 '\000',
dependsOnRole = 0 '\000',
parallelModeNeeded = 0 '\000',
planTree = 0x206fde8,
rtable = 0x20700b8,
resultRelations = 0x0,
utilityStmt = 0x0,
subplans = 0x0,
rewindPlanIDs = 0x0,
rowMarks = 0x0,
relationOids = 0x2070108,
invalItems = 0x0,
nParamExec = 0
} ##可以看到最終的plan->relationOids list中包含三個node,即主表和兩個分區表,和我們之前看到“好的”執行計劃結果是相符的##
(gdb) p *plan->relationOids
$60 = {
type = T_OidList,
length = 3,
head = 0x20700e8,
tail = 0x2070428
}
(gdb) p *plan->relationOids->head ##主表##
$61 = {
data = {
ptr_value = 0x418d,
int_value = 16781,
oid_value = 16781
},
next = 0x20702d8
}
##分區表qsump_pacloud_oscginfo_activity_detail_info_day_2##
(gdb) p *plan->relationOids->head->next
$62 = {
data = {
ptr_value = 0x41df,
int_value = 16863,
oid_value = 16863
},
next = 0x2070428
}
##分區表qsump_pacloud_oscginfo_activity_detail_info_day_3##
(gdb) p *plan->relationOids->tail
$63 = {
data = {
ptr_value = 0x41e5,
int_value = 16869,
oid_value = 16869
},
next = 0x0
}
(gdb) n
822 return plan;
(gdb)
823 }
##跟蹤結束##
3. 調試to_timestamp()形式的語句
這裏只體現與之前不同的部分
session1:
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
|
session2:
##這裏與之前不同,之前node_type爲T_Const,而當前爲T_FuncExpr
Breakpoint 20, IsConstValue (node=0x2071350, context=0x7ffe113be790) at src/pg_pathman.c:114
114 switch (nodeTag(node))
(gdb) p *node
$2 = {
type = T_FuncExpr
}
(gdb) list
109
110 /* Can we transform this node into a Const? */
111 static bool
112 IsConstValue(Node *node, const WalkerContext *context)
113 {
114 switch (nodeTag(node))
115 {
116 case T_Const:
117 return true;
118
(gdb)
119 case T_Param:
120 return WcxtHasExprContext(context);
121
122 case T_RowExpr:
123 {
124 RowExpr *row = (RowExpr *) node;
125 ListCell *lc;
126
127 /* Can't do anything about RECORD of wrong type */
128 if (row->row_typeid != context->prel->ev_type)
(gdb)
129 return false;
130
131 /* Check that args are const values */
132 foreach (lc, row->args)
133 if (!IsConstValue((Node *) lfirst(lc), context))
134 return false;
135 }
136 return true;
137 ##IsConstValue函數中,並沒有對T_FuncExpr做一個case分支去處理,因此,走了default,返回了false
138 default:
(gdb)
139 return false;
140 }
141 }
142
143 /* Extract a Const from node that has been checked by IsConstValue() */
144 static Const *
145 ExtractConst(Node *node, const WalkerContext *context)
146 {
147 ExprState *estate;
148 ExprContext *econtext = context->econtext;
(gdb) n
139 return false;
(gdb)
141 }
(gdb)
handle_opexpr (expr=0x20713c0, context=0x7ffe113be790, result=0x2079e40) at src/pg_pathman.c:1342
1342 else if (IsA(param, Param) || IsA(param, Var))
(gdb) n
##由於IsConstValue返回了false,因此沒有進入handle_const裏調用select_range_partitions選擇分區表,而是直接將所有分區追加到result->rangeset node##
1352 result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
(gdb) list
1347 return; /* done, exit */
1348 }
1349 }
1350 }
1351
1352 result->rangeset = list_make1_irange_full(prel, IR_LOSSY);
1353 result->paramsel = 1.0;
1354 }
1355
1356
(gdb) n
1353 result->paramsel = 1.0;
(gdb)
1354 }
(gdb)
所有的分區表被追加到root node,而我們的where條件僅僅當做filter去處理,並沒有先根據條件選擇分區表。
##11個分區表均被追加到root node##
(gdb)
384 parent_rel = heap_open(rte->relid, NoLock);
(gdb)
387 if (prel->enable_parent)
(gdb)
393 foreach(lc, ranges)
(gdb)
395 IndexRange irange = lfirst_irange(lc);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);
(gdb)
397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)
(gdb)
393 foreach(lc, ranges)
(gdb) p i
$3 = 11
直接來看最後生成的plantree
(gdb)
pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792
792 if (log_planner_stats)
(gdb)
817 if (Debug_print_plan)
(gdb)
818 elog_node_display(LOG, "plan", plan, Debug_pretty_print);
(gdb) p *plan
$5 = {
type = T_PlannedStmt,
commandType = CMD_SELECT,
queryId = 0,
hasReturning = 0 '\000',
hasModifyingCTE = 0 '\000',
canSetTag = 1 '\001',
transientPlan = 0 '\000',
dependsOnRole = 0 '\000',
parallelModeNeeded = 0 '\000',
planTree = 0x208f028,
rtable = 0x208f2f8,
resultRelations = 0x0,
utilityStmt = 0x0,
subplans = 0x0,
rewindPlanIDs = 0x0,
rowMarks = 0x0,
relationOids = 0x208f348,
invalItems = 0x0,
nParamExec = 0
}
##可以看到,最終的plan->relationOids list包含12個node,即主表加11個分區表##
(gdb) p *plan->relationOids
$4 = {
type = T_OidList,
length = 12,
head = 0x2091338,
tail = 0x2092218
}
(gdb) p *plan->relationOids->head ##主表##
$5 = {
data = {
ptr_value = 0x418d,
int_value = 16781,
oid_value = 16781
},
next = 0x20914b8
}
##第11個分區表qsump_pacloud_oscginfo_activity_detail_info_day_11##
(gdb) p *plan->relationOids->tail
$6 = {
data = {
ptr_value = 0x4215,
int_value = 16917,
oid_value = 16917
},
next = 0x0
}
(gdb)
四、總結反思
1. 修改思路
通過gdb跟蹤,現在已經明確了是IsConstValue函數中,不存在T_FuncExpr case分支,導致T_FuncExpr類型直接走了default,沒有進行分區表的篩選。
那麼修改方案是否可以爲:
1)IsConstValue函數中加入T_FuncExpr case分支處理,實現分區的篩選
2)大致翻閱了下postgresql主體的源代碼,發現主體代碼中存在很多處理方式,比如將T_FuncExpr轉換爲simple_expr。那麼pg_pathman中能否對node type做下處理,將T_FuncExpr轉化爲T_Const
測試了目前最新的pg_pathman版本 1.5.11存在同樣的問題,但是postgresql原生的聲明式分區不存在這樣的問題,都可以處理。
/* Can we transform this node into a Const? */
static bool
IsConstValue(Node *node, const WalkerContext *context)
{
switch (nodeTag(node))
{
case T_Const:
return true;
case T_Param:
return WcxtHasExprContext(context);
case T_RowExpr:
{
RowExpr *row = (RowExpr *) node;
ListCell *lc;
/* Can't do anything about RECORD of wrong type */
if (row->row_typeid != context->prel->ev_type)
return false;
/* Check that args are const values */
foreach (lc, row->args)
if (!IsConstValue((Node *) lfirst(lc), context))
return false;
}
return true;
default:
return false;
}
}
2. 修改的必要性
這是一個相對性看待的問題,我認爲對於應用開發同學來說,有些不友好,只能使用::date這種方式。
請教過社區的專家,得知postgresql12原生聲明式分區的性能已經可以和pg_pathman媲美,幾乎不相上下,因此對於12及以後版本可以考慮優先使用聲明式分區,拋棄插件。但是11以前到9.5之間的版本,由於自身分區性能較差,可能還是需要pg_pathman來實現。
筆者是一名剛入坑不久的小菜鳥,對於postgresql代碼不熟悉,再加自身代碼能力薄弱,修改方案只能提出自己的一些小想法,已經將問題反饋給了公司數據庫專家團隊。同時希望社區的前輩,專家如果已有更好的解決方案,麻煩共享出來,算是爲大家,爲社區做了貢獻。