使用repmgrd實現postgresql failover和auto failover

前面的文章介紹了postgresql基於repmgr的高可用及切換方案，這篇文章主要聊聊通過repmgrd實現failover及auto failover。

前提是部署好postgresql主從，同時部署好repmgr。

[postgres@node1 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

failover

停止主庫，模擬主庫故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/
waiting for server to shut down..... done
server stopped

備庫查看是unreachable狀態

[postgres@node2 .ssh]$ repmgr cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+---------------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | ? unreachable |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

備庫提升爲主庫

[postgres@node2 ~]$ repmgr standby promote
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary

新主庫查看集羣狀態

[postgres@node2 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

原主庫執行rejoin操作重新加入集羣

[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 2 files copied to /pgdata
INFO: directory "/tmp/repmgr-config-archive-node1" deleted
INFO: deleting "recovery.done"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

查看集羣狀態

[postgres@node1 pgdata]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

auto failover

可以利用repmgrd進程實現自動的failover，首先要在repmgr.conf文件中將location參數設置爲一致，不設置的話默認也是一致的。同時啓動repmgrd必須在postgres.conf配置文件中設置shared_preload_libraries=‘repmgr’

修改主備庫repmgr.conf文件

failover=automatic 
promote_command='/pgsql/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file=/home/postgres/repmgrd.log
monitoring_history=true （啓用監控參數）                    
monitor_interval_secs=5（定義監視數據間隔寫入時間參數）
reconnect_attempts=10（故障轉移之前，嘗試重新連接主庫次數（默認爲6）參數）
reconnect_interval=5（每間隔5s嘗試重新連接一次參數）

重啓主備庫使修改生效

[postgres@node1 ~]$ repmgr node service --action=restart
DETAIL: executing server command "pg_ctl  -w -D '/pgdata' restart"

主備庫啓動repmgrd

[postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
[2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"

模擬主庫故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/
waiting for server to shut down..... done
server stopped

查看備庫日誌，發現已經升爲主庫

[2019-09-20 12:02:52] [NOTICE] promoting standby to primary
[2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"
[2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful
[2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2019-09-20 12:02:52] [INFO] 0 followers to notify
[2019-09-20 12:02:52] [INFO] switching to primary monitoring mode
[2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)

查看cluster狀態，備庫已經升主

[postgres@node2 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 5        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

原主庫執行rejoin加入集羣

[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 2 files copied to /pgdata
INFO: directory "/tmp/repmgr-config-archive-node1" deleted
INFO: deleting "recovery.done"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

查看集羣狀態

[postgres@node1 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 5        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 6        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

歡迎關注我的公衆號：數據庫架構之美

使用repmgrd實現postgresql failover和auto failover

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

使用repmgrd實現postgresql failover和auto failover

全面講解分佈式數據庫架構設計特點

華爲GaussDB數據庫相比PostgreSQL做了哪些內核優化

基於repmgr的postgresql主備高可用方案

PostgreSQL數據庫xlog文件命名

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結