PostgreSQL運維案例--check約束超長導致查詢失敗

一、 Pathman簡介

由於以前PostgreSQL社區版本的分區表功能比較弱,需要通過繼承和初始化或RULE來實現分區表的功能,查詢和更新涉及約束的檢查,插入則涉及轉換或規則重構,導致分區功能性能較弱差。Postgrespro公司開發了pg_pathman插件,適用於9.5及之後的版本,與傳統方式不同的是,pg_pathman將分區的定義放置在一張元數據表中,表的信息會緩存在內存中,同時使用HOOK來實現關係的替換,所以效率非常高。

二、 問題背景

1.版本信息
Postgresql版本:9.6.6
Pg_Pathman版本:1.4
2.問題現象
一測試庫查詢報錯如下:

postgres=# select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;
ERROR:  constraint "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" of partition "qsump_pacloud_oscginfo_activity_detail_info_day_10" does not exist
HINT:  pg_pathman will be disabled to allow you to resolve this issue

三、 問題分析
  1. 這個報錯從字面解釋來看,是分區表的約束不存在。查詢pg_constraint表,發現通過報錯中的約束名可以查到相關信息
postgres=# select conname,contype,convalidated from pg_constraint where conname='pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check';
-[ RECORD 1 ]+---------------------------------------------------------
conname      | pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec
contype      | c
convalidated | t	

  1. 可以發現查詢結果的conname和報錯中的相比,末尾的check缺少了字母k。爲什麼會出現這樣的情況?這個只能在源代碼中尋找答案
  2. 報錯函數如下:
get_partition_constraint_expr(Oid partition)
{
	Oid			conid;			/* constraint Oid */
	char	   *conname;		/* constraint name */
	HeapTuple	con_tuple;
	Datum		conbin_datum;
	bool		conbin_isnull;
	Expr	   *expr;			/* expression tree for constraint */
	conname = build_check_constraint_name_relid_internal(partition); //“拼接”conname
	conid = get_relation_constraint_oid(partition, conname, true);//從pg_constrain中查找conname是否存在
	/*若不存在則進入報錯分支*/
	if (!OidIsValid(conid)) 
	{
		DisablePathman(); /* disable pg_pathman since config is broken */
		ereport(ERROR,
				(errmsg("constraint \"%s\" of partition \"%s\" does not exist",
						conname, get_rel_name_or_relid(partition)),
				 errhint(INIT_ERROR_HINT)));
	}}

顯然是OidIsValid(conid)函數返回了false,才進入報錯分支
OidIsValid函數定義爲:#define OidIsValid(objectId) ((bool) ((objectId) != InvalidOid)),因此推斷conid=InvalidOid

  1. 而conid是函數**get_relation_constraint_oid(partition, conname, true)的返回值,其中conname是build_check_constraint_name_relid_internal(partition)**的返回值,需要分析這兩個函數邏輯以及變量的值

首先分析conname:

char *
build_check_constraint_name_relid_internal(Oid relid)
{
	AssertArg(OidIsValid(relid));
return build_check_constraint_name_relname_internal(get_rel_name(relid));
}

/*
 * Generate check constraint name for a partition.
 * NOTE: this function does not perform sanity checks at all.
 */
char *
build_check_constraint_name_relname_internal(const char *relname)
{
    /*拼接conname*/
	return psprintf("pathman_%s_check", relname);
}

可以看到conname是在build_check_constraint_name_relname_internal函數中拼接的,%s傳遞的是表名,從報錯來看,拼接的conname爲:“pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check

接着看conid:

Oid
get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok)
{
	Relation	pg_constraint;
	HeapTuple	tuple;
	SysScanDesc scan;
	ScanKeyData skey[1];
	Oid			conOid = InvalidOid;   //conOid初始值爲InvalidOid

	/*
	 * Fetch the constraint tuple from pg_constraint.  There may be more than
	 * one match, because constraints are not required to have unique names;
	 * if so, error out.
	 */
	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);    

	ScanKeyInit(&skey[0],
				Anum_pg_constraint_conrelid,
				BTEqualStrategyNumber, F_OIDEQ,
				ObjectIdGetDatum(relid));

	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
							  NULL, 1, skey); //讀取pg_constraint

	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
	{
		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
        /*比較從pg_constraint中獲取的conname和上一步拼接傳入的conname是否一致*/
        /*當conname一致時,也就是說存在這個constraint,獲取其conOid*/
		if (strcmp(NameStr(con->conname), conname) == 0)
		{
			if (OidIsValid(conOid))
				ereport(ERROR,
						(errcode(ERRCODE_DUPLICATE_OBJECT),
				 errmsg("table \"%s\" has multiple constraints named \"%s\"",
						get_rel_name(relid), conname)));
			conOid = HeapTupleGetOid(tuple);
		}
	}

	systable_endscan(scan);
     
	/* If no such constraint exists, complain */
	/*當conname不一致時,conOid還爲初始值InvalidOid*/
	if (!OidIsValid(conOid) && !missing_ok)
		ereport(ERROR,
				(errcode(ERRCODE_UNDEFINED_OBJECT),
				 errmsg("constraint \"%s\" for table \"%s\" does not exist",
						conname, get_rel_name(relid))));

	heap_close(pg_constraint, AccessShareLock);

	return conOid;        //返回conOid的值
}

可以看到conoid的初值爲InvalidOid,然後從pg_constraint系統表裏獲取到的conname和上一步傳入的conname不一致,最終conoid沒有經過賦值,函數返回了InvalidOid,因此最後導致了報錯。剛纔已經發現pg_constraint中查詢到的conname爲:
pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec確實如此,pg_constraint中存儲的conname是個錯誤值,導致報錯發生。

  1. 看似是conname在pg_constraint中被“截斷了”,也就是說,應該是該conname超過了表pg_constraint定義的最大長度。
NameData	conname;		/* name of this constraint */

typedef struct nameData
{
	char		data[NAMEDATALEN];
} NameData;
typedef NameData *Name;

#define NAMEDATALEN 64

postgres=# select char_length('pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec');
 char_length
-------------
          63
(1 row)

可以看到conname的最大長度爲64,而pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec的長度爲63,再加結束符‘\0’,剛好是64。

  1. 初步的代碼走讀,證明了是約束名conname長度超過字段最大長度了,導致存儲的約束名不完整,在查詢校驗約束時報錯了。後邊的GDB調試也證實了確實是這樣,見文章最後。
四、 解決方案
1. 方案討論

1)約束名過長是由於表名太長,可以壓縮一下表名,將某些地方縮寫。這個方案看起來最簡單快捷,但是開發兄弟不一定認可接受,可能對於整個應用來說並不是修改一個表名那麼簡單
站在應用的角度,可以認爲是數據庫給定的字段長度不夠,那麼就得考慮是否可以從數據庫自身去修改:
2)修改字段最大長度,從64調整爲128,但是 NAMEDATALEN
這個宏,涉及到所有的表、函數、觸發器等對象的名稱,牽一髮則動全身,這個方案風險大,不可行。
3)繼續走讀代碼,發現約束名是在創建分區表的同時創建的,也是通過調用build_check_constraint_name_relname_internal函數,拼接爲pathman_%s_check,%s爲表名。不難發現僅pathman和check合起來就佔了12個字符長度,爲了保證可讀性check保持不變,可以將pathman縮寫爲pm,這樣能夠節省出5個字符長度,那麼以現在的表名長度,可以容納99萬個分區表,足夠了。
在這裏插入圖片描述

2. 方案實施

經過討論,考慮到改動數據庫的風險較大,最終還是選擇了壓縮表名,把幾個單詞全拼做了簡寫,並重建了分區。

3. 其他測試

作爲dba,數據庫側的修改,我們還是要自己玩一玩的,驗證下其餘兩個方案
方案2)修改後查詢成功,未出現報錯。篇幅有限這裏不討論細節
方案3)如下:
修改代碼

/*
 * Generate check constraint name for a partition.
 * NOTE: this function does not perform sanity checks at all.
 */
char *
build_check_constraint_name_relname_internal(const char *relname)
{
        /* Modified Begin pathman to pm */
        return psprintf("pm_%s_check", relname);
        /* End 2020-01-10 */
}

編譯
cd $pg_pathmansrcdir
Make && make install

登錄數據庫執行 drop extension pg_pathman ;

重啓數據庫並create extension pg_pathman ;

創建分區表,插入數據,查詢成功未報錯

postgres=# create table qsump_pacloud_oscginfo_activity_detail_info_day(id int primary key,add_time timestamp without time zone not null);
CREATE TABLE
postgres=# select create_range_partitions('qsump_pacloud_oscginfo_activity_detail_info_day'::REGCLASS,
postgres(#                         'add_time',
postgres(#                         '2020-01-01 00:00:00'::timestamp without time zone  ,
postgres(#                          interval  '7 days'      ,
postgres(#                         9,
postgres(#                         true  );
 create_range_partitions
-------------------------
                       9
(1 row)

postgres=# insert into qsump_pacloud_oscginfo_activity_detail_info_day values (10,'2020-03-12 00:00:00'::timestamp without time zone);
INSERT 0 1
postgres=# select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;
 id |      add_time
----+---------------------
 10 | 2020-03-12 00:00:00
(1 row)

postgres=# \d+ qsump_pacloud_oscginfo_activity_detail_info_day_10
             Table "public.qsump_pacloud_oscginfo_activity_detail_info_day_10"
  Column  |            Type             | Modifiers | Storage | Stats target | Description
----------+-----------------------------+-----------+---------+--------------+-------------
 id       | integer                     | not null  | plain   |              |
 add_time | timestamp without time zone | not null  | plain   |              |
Indexes:
    "qsump_pacloud_oscginfo_activity_detail_info_day_10_pkey" PRIMARY KEY, btree (id)
Check constraints:
    "pm_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" CHECK (add_time >= '2020-03-04 00:00:00'::timestamp without time zone AND add_time < '2020-03-11 00:00:00'::timestamp without time zone)
Inherits: qsump_pacloud_oscginfo_activity_detail_info_day

postgres=# select char_length('pm_qsump_pacloud_oscginfo_activity_detail_info_day_10_check');
 char_length
-------------
          59
(1 row)

postgres=#

五、 GDB調試

Session 1:

建立一個連接,查詢pid

[postgres@postgres_zabbix ~]$ psql
psql (9.6.6)
Type "help" for help.
postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
           4240

Session 2:

gdb調試session
[postgres@postgres_zabbix ~]$ gdb
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
(gdb) attach 4240                                         //attach session 1      
Attaching to process 4240
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Loaded symbols for /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007f995cd6a5e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) b get_partition_constraint_expr                          //設置斷點
Breakpoint 1 at 0x7f99562fc3f0: file src/relation_info.c, line 1283.
(gdb) b get_relation_constraint_oid                            //設置斷點
Breakpoint 2 at 0x544991: file pg_constraint.c, line 771.
Session 2:
postgres=#  select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;                                                   //執行查詢
session 1(gdb) n                                                        //開始調試
//省略了一些過程,從第一個分區表開始
Breakpoint 1, get_partition_constraint_expr (partition=16589) at src/relation_info.c:1283
1283            conname = build_check_constraint_name_relid_internal(partition);
(gdb)
1284            conid = get_relation_constraint_oid(partition, conname, true);
(gdb)
Breakpoint 2, get_relation_constraint_oid (relid=16589,
    conname=0x13997e8 "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_1_check", missing_ok=1 '\001') at pg_constraint.c:771                //傳入拼接的conname
771             Oid                     conOid = InvalidOid;
(gdb)
778             pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
(gdb)
780             ScanKeyInit(&skey[0],
(gdb)
785             scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);             //獲取pg_constraint中的數據
(gdb)
(gdb) n
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb) p *con
$2 = {conname = {data = "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_1_check"}, connamespace = 2200, contype = 99 'c',  condeferrable = 0 '\000', condeferred = 0 '\000', convalidated = 1 '\001', conrelid = 16589, contypid = 0, conindid = 0,
  confrelid = 0, confupdtype = 32 ' ', confdeltype = 32 ' ', confmatchtype = 32 ' ', conislocal = 1 '\001', coninhcount = 0,
  connoinherit = 0 '\000'} //pg_constraint中的conname和查詢時拼接的一致

(gdb) n
794                             if (OidIsValid(conOid))
(gdb) n
799                             conOid = HeapTupleGetOid(tuple);
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
803             systable_endscan(scan);
(gdb)
806             if (!OidIsValid(conOid) && !missing_ok)
(gdb)
812             heap_close(pg_constraint, AccessShareLock);
(gdb)
814             return conOid;
(gdb) p conOid
$5 = 16788                                      //返回的conoid爲16788
//省略一些過程,直接看報錯的這個分區表
(gdb)
Breakpoint 1, get_partition_constraint_expr (partition=16643) at src/relation_info.c:1283
1283            conname = build_check_constraint_name_relid_internal(partition);
(gdb)
1284            conid = get_relation_constraint_oid(partition, conname, true);
(gdb)

Breakpoint 2, get_relation_constraint_oid (relid=16643,
    conname=0x13997e8 "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check", missing_ok=1 '\001') at pg_constraint.c:771                //傳入拼接的conname
771             Oid                     conOid = InvalidOid;
(gdb)
778             pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
(gdb)
780             ScanKeyInit(&skey[0],
(gdb)
785             scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);                                 //讀取pg_constraint的數據
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb) p *con
$4 = {conname = {data = "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec"}, connamespace = 2200, contype = 99 'c',  condeferrable = 0 '\000', condeferred = 0 '\000', convalidated = 1 '\001', conrelid = 16643, contypid = 0, conindid = 0,
  confrelid = 0, confupdtype = 32 ' ', confdeltype = 32 ' ', confmatchtype = 32 ' ', conislocal = 1 '\001', coninhcount = 0,
  connoinherit = 0 '\000'}//pg_constraint中讀取的conname和查詢時拼接的不同
(gdb) n
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
803             systable_endscan(scan);
(gdb)
806             if (!OidIsValid(conOid) && !missing_ok)
(gdb)
812             heap_close(pg_constraint, AccessShareLock);
(gdb)
814             return conOid;
(gdb) p conOid
$5 = 0                                     //因此返回的conoid爲0即InvalidOid
(gdb) n
815     }
(gdb)

Session 2:

ERROR:  constraint "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" of partition "qsump_pacloud_oscginfo_activity_detail_info_day_10" does not exist
HINT:  pg_pathman will be disabled to allow you to resolve this issue
postgres=#

這時session2已經報錯了,以上gdb過程,證實了之前的代碼分析。

參考:

https://github.com/digoal/blog/blob/d7336aeb9fc9cc82714189f16d67d22e47f9d369/201610/20161024_01.md

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章