MHA 一個slave宕機的影響

文章目錄

環境說明

IP	角色	備註	mha4mysql-node	mha4mysql-manager
192.168.98.11	master	讀寫	√
192.168.98.10	slave	只讀	√
192.168.98.12	slave	只讀	√
192.168.98.13	manager節點	N/A	√	√

運行前有節點宕機

手動關閉一個從庫192.168.98.10mysqld後嘗試啓動masterha_manager

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf

啓動失敗, 日誌中有如下信息

Fri Feb 28 14:47:58 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 14:47:59 2020 - [info] GTID failover mode = 1
Fri Feb 28 14:47:59 2020 - [info] Dead Servers:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 14:47:59 2020 - [info] Alive Servers:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 14:47:59 2020 - [info] Alive Slaves:
Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 14:47:59 2020 - [info]     GTID ON
Fri Feb 28 14:47:59 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 14:47:59 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 14:47:59 2020 - [info] Checking slave configurations..
Fri Feb 28 14:47:59 2020 - [info] Checking replication filtering settings..
Fri Feb 28 14:47:59 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 14:47:59 2020 - [info]  Replication filtering check ok.
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.
Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 14:47:59 2020 - [info] Got exit code 1 (Not master dead).

應該先使用masterha_check_repl檢查複製狀態

#masterha_check_repl --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnf
Fri Feb 28 15:27:24 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:27:24 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:27:24 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:27:24 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:27:25 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:27:25 2020 - [info] Dead Servers:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:27:25 2020 - [info] Alive Servers:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:27:25 2020 - [info] Alive Slaves:
Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:27:25 2020 - [info]     GTID ON
Fri Feb 28 15:27:25 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:27:25 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:27:25 2020 - [info] Checking slave configurations..
Fri Feb 28 15:27:25 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:27:25 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:27:25 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.
Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:27:25 2020 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

在文檔https://github.com/yoshinorim/mha4mysql-manager/wiki/masterha_manager中:

--ignore_fail_on_start

By default, master monitoring (not failover) process stops if one or more slaves are down, regardless of “ignore_fail” parameter setting. By setting --ignore_fail_on_start, master monitoring does not stop if ignore_fail marked slaves are down.

默認情況下，如果一個或多個從庫宕機，則不管“ ignore_fail”參數設置如何，主服務器監視（非故障轉移）過程都會停止。通過設置–ignore_fail_on_start，如果標記爲ignore_fail的從屬服務器已關閉，則主監視不會停止。

這個意思就是說如果在配置文件中設置了爲10設置了ignore_fail=1, 那麼再加上--ignore_fail_on_start可以啓動masterha_manager, 否則如果不在配置文件中指定ignore_fail=1即使指定了--ignore_fail_on_start也是不能啓動的

加上ignore_fail=1

#cat /etc/masterha/conf/cls_all.cnf 
[server default]
#workdir on the management server
manager_workdir=/masterha/cls_all/
manager_log=/masterha/cls_all/manager.log

#workdir on the node for mysql server
master_binlog_dir=/data/mysql_3306/data/

#自動故障VIP切換調用腳本
master_ip_failover_script=/etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100

#手動故障切換調用腳本
master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change_vip --vip=192.168.98.100

#檢測master的可用性
secondary_check_script=masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12

[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
# no_master=1

啓動成功

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start

Fri Feb 28 15:59:37 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:59:38 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:59:38 2020 - [info] Dead Servers:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:59:38 2020 - [info] Alive Servers:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:59:38 2020 - [info] Alive Slaves:
Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:59:38 2020 - [info]     GTID ON
Fri Feb 28 15:59:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:59:38 2020 - [info] Checking slave configurations..
Fri Feb 28 15:59:38 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:59:38 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:59:38 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:59:38 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Feb 28 15:59:38 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Feb 28 15:59:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Fri Feb 28 15:59:39 2020 - [info] 
192.168.98.11(192.168.98.11:3306) (current master)
 +--192.168.98.12(192.168.98.12:3306)

Fri Feb 28 15:59:39 2020 - [info] Checking master_ip_failover_script status:
Fri Feb 28 15:59:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=status --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 
Fri Feb 28 15:59:39 2020 - [info]  OK.
Fri Feb 28 15:59:39 2020 - [warning] shutdown_script is not defined.
Fri Feb 28 15:59:39 2020 - [info] Set master ping interval 3 seconds.
Fri Feb 28 15:59:39 2020 - [info] Set secondary check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12
Fri Feb 28 15:59:39 2020 - [info] Starting ping health check on 192.168.98.11(192.168.98.11:3306)..
Fri Feb 28 15:59:39 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

不加

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
# ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
# no_master=1

啓動失敗

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start

Fri Feb 28 15:58:57 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:58:58 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:58:58 2020 - [info] Dead Servers:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:58:58 2020 - [info] Alive Servers:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:58:58 2020 - [info] Alive Slaves:
Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:58:58 2020 - [info]     GTID ON
Fri Feb 28 15:58:58 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:58:58 2020 - [info] Checking slave configurations..
Fri Feb 28 15:58:58 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:58:58 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:58:58 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:58:58 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 402.
Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:58:58 2020 - [info] Got exit code 1 (Not master dead).

另外如果加了ignore_fail=1 但是僅僅剩下的一個12指定了no_master=1的話也無法啓動

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
no_master=1

None of slaves can be master

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_start


Fri Feb 28 15:55:14 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Feb 28 15:55:16 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:55:16 2020 - [info] Dead Servers:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:55:16 2020 - [info] Alive Servers:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:55:16 2020 - [info] Alive Slaves:
Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:55:16 2020 - [info]     GTID ON
Fri Feb 28 15:55:16 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:55:16 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:55:16 2020 - [info] Checking slave configurations..
Fri Feb 28 15:55:16 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:55:16 2020 - [info]  binlog_do_db= , binlog_ignore_db= 
Fri Feb 28 15:55:16 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnf
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.
Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Feb 28 15:55:16 2020 - [info] Got exit code 1 (Not master dead).

運行中有點節點宕機

如果masterha_manager運行中一個從庫宕機, masterha_manager貌似無感知, 因爲masterha_manager進程沒有退出, 日誌也沒有報錯

check_status仍然是正常的

#masterha_check_status --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnf
cls_all (pid:88464) is running(0:PING_OK), master:192.168.98.11

但是手動切換會失敗

#/usr/local/bin/masterha_master_switch --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --master_state=alive --new_master_host=192.168.98.12 --new_master_port=3306 --orig_master_is_new_slave --interactive=0
Fri Feb 28 15:33:34 2020 - [info] MHA::MasterRotate version 0.58.
Fri Feb 28 15:33:34 2020 - [info] Starting online master switch..
Fri Feb 28 15:33:34 2020 - [info] 
Fri Feb 28 15:33:34 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 15:33:34 2020 - [info] 
Fri Feb 28 15:33:34 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:33:34 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:33:34 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:33:35 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln94] Switching master should not be started if one or more servers is down.
Fri Feb 28 15:33:35 2020 - [info] Dead Servers:
Fri Feb 28 15:33:35 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53.

Dead Servers:會列出有問題的Server

如果在10還沒修復時Master11掛了, 同時12設置了no_master, 自動failover會失敗, 因爲沒有新的master可以用

#cat /etc/masterha/conf/cls_all.cnf 
...
[server1]
hostname=192.168.98.10
candidate_master=1
ignore_fail=1

[server2]
hostname=192.168.98.11
candidate_master=1

[server3]
hostname=192.168.98.12
no_master=1

關閉11

Fri Feb 28 15:35:38 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:38 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECT
Fri Feb 28 15:35:38 2020 - [info] Executing SSH check script: exit 0
Fri Feb 28 15:35:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.
Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.
Fri Feb 28 15:35:40 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Feb 28 15:35:41 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:41 2020 - [warning] Connection failed 2 time(s)..
Fri Feb 28 15:35:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:44 2020 - [warning] Connection failed 3 time(s)..
Fri Feb 28 15:35:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 15:35:47 2020 - [warning] Connection failed 4 time(s)..
Fri Feb 28 15:35:47 2020 - [warning] Master is not reachable from health checker!
Fri Feb 28 15:35:47 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!
Fri Feb 28 15:35:47 2020 - [warning] SSH is reachable.
Fri Feb 28 15:35:47 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..
Fri Feb 28 15:35:47 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 15:35:47 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:35:47 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 15:35:48 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:35:48 2020 - [info] Dead Servers:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:48 2020 - [info] Alive Servers:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:35:48 2020 - [info] Alive Slaves:
Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:35:48 2020 - [info]     GTID ON
Fri Feb 28 15:35:48 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:48 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:35:48 2020 - [info] Checking slave configurations..
Fri Feb 28 15:35:48 2020 - [info] Checking replication filtering settings..
Fri Feb 28 15:35:48 2020 - [info]  Replication filtering check ok.
Fri Feb 28 15:35:48 2020 - [info] Master is down!
Fri Feb 28 15:35:48 2020 - [info] Terminating monitoring script.
Fri Feb 28 15:35:48 2020 - [info] Got exit code 20 (Master dead).
Fri Feb 28 15:35:48 2020 - [info] MHA::MasterFailover version 0.58.
Fri Feb 28 15:35:48 2020 - [info] Starting master failover.
Fri Feb 28 15:35:48 2020 - [info] 
Fri Feb 28 15:35:48 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 15:35:48 2020 - [info] 
Fri Feb 28 15:35:49 2020 - [info] GTID failover mode = 1
Fri Feb 28 15:35:49 2020 - [info] Dead Servers:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:49 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Feb 28 15:35:49 2020 - [info]  ok.
Fri Feb 28 15:35:49 2020 - [info] Alive Servers:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 15:35:49 2020 - [info] Alive Slaves:
Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 15:35:49 2020 - [info]     GTID ON
Fri Feb 28 15:35:49 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 15:35:49 2020 - [info]     Not candidate for the new Master (no_master is set)
Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.
Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/share/perl5/MHA/MasterFailover.pm line 269.

主要問題在

Not candidate for the new Master (no_master is set)

Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings

vip還正在原Master11上

root@localhost 14:40:38 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 15:35:04 [(none)]> shutdown;
Query OK, 0 rows affected (0.00 sec)

root@localhost 15:35:37 [(none)]> 2020-02-28T07:35:50.083534Z mysqld_safe mysqld from pid file /data/mysql_3306/run/mysql.pid ended

root@localhost 15:36:40 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link 
       valid_lft forever preferred_lft forever

12仍然是從庫, 且沒有vip

root@localhost 15:35:32 [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 192.168.98.11
                  Master_User: repler
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 2496
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 1354
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2496
              Relay_Log_Space: 1561
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master '[email protected]:3306' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 98113306
                  Master_UUID: 68703597-592c-11ea-88b3-000c2998280b
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 200228 15:35:45
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
            Executed_Gtid_Set: 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,
68703597-592c-11ea-88b3-000c2998280b:1-4
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

root@localhost 15:36:32 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::ef03:3251:b4ed:204c/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 15:36:37 [(none)]>

如果有候選master, 也就是12沒有加no_master=1是可以自動failover的

Fri Feb 28 16:16:27 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:27 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECT
Fri Feb 28 16:16:27 2020 - [info] Executing SSH check script: exit 0
Fri Feb 28 16:16:28 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.
Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.
Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.
Fri Feb 28 16:16:28 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Feb 28 16:16:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:30 2020 - [warning] Connection failed 2 time(s)..
Fri Feb 28 16:16:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:33 2020 - [warning] Connection failed 3 time(s)..
Fri Feb 28 16:16:36 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))
Fri Feb 28 16:16:36 2020 - [warning] Connection failed 4 time(s)..
Fri Feb 28 16:16:36 2020 - [warning] Master is not reachable from health checker!
Fri Feb 28 16:16:36 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!
Fri Feb 28 16:16:36 2020 - [warning] SSH is reachable.
Fri Feb 28 16:16:36 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..
Fri Feb 28 16:16:36 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Feb 28 16:16:36 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 16:16:36 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..
Fri Feb 28 16:16:37 2020 - [info] GTID failover mode = 1
Fri Feb 28 16:16:37 2020 - [info] Dead Servers:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:37 2020 - [info] Alive Servers:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:37 2020 - [info] Alive Slaves:
Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:37 2020 - [info]     GTID ON
Fri Feb 28 16:16:37 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:37 2020 - [info] Checking slave configurations..
Fri Feb 28 16:16:37 2020 - [info] Checking replication filtering settings..
Fri Feb 28 16:16:37 2020 - [info]  Replication filtering check ok.
Fri Feb 28 16:16:37 2020 - [info] Master is down!
Fri Feb 28 16:16:37 2020 - [info] Terminating monitoring script.
Fri Feb 28 16:16:37 2020 - [info] Got exit code 20 (Master dead).
Fri Feb 28 16:16:37 2020 - [info] MHA::MasterFailover version 0.58.
Fri Feb 28 16:16:37 2020 - [info] Starting master failover.
Fri Feb 28 16:16:37 2020 - [info] 
Fri Feb 28 16:16:37 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Feb 28 16:16:37 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] GTID failover mode = 1
Fri Feb 28 16:16:38 2020 - [info] Dead Servers:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:38 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Feb 28 16:16:38 2020 - [info]  ok.
Fri Feb 28 16:16:38 2020 - [info] Alive Servers:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:38 2020 - [info] Alive Slaves:
Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:38 2020 - [info]     GTID ON
Fri Feb 28 16:16:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:38 2020 - [info] Starting GTID based failover.
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Feb 28 16:16:38 2020 - [info] 
Fri Feb 28 16:16:38 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Feb 28 16:16:38 2020 - [info] Executing master IP deactivation script:
Fri Feb 28 16:16:38 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --command=stopssh --ssh_user=root  
Disabling the VIP on old master: 192.168.98.11 
Fri Feb 28 16:16:39 2020 - [info]  done.
Fri Feb 28 16:16:39 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Feb 28 16:16:39 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:234
Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:39 2020 - [info]     GTID ON
Fri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:39 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:234
Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Oldest slaves:
Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Fri Feb 28 16:16:39 2020 - [info]     GTID ON
Fri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] Searching new master from slaves..
Fri Feb 28 16:16:39 2020 - [info]  Candidate masters from the configuration file:
Fri Feb 28 16:16:39 2020 - [info]  Non-candidate masters:
Fri Feb 28 16:16:39 2020 - [info] New master is 192.168.98.12(192.168.98.12:3306)
Fri Feb 28 16:16:39 2020 - [info] Starting master failover..
Fri Feb 28 16:16:39 2020 - [info] 
From:
192.168.98.11(192.168.98.11:3306) (current master)
 +--192.168.98.12(192.168.98.12:3306)

To:
192.168.98.12(192.168.98.12:3306) (new master)
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info]  Waiting all logs to be applied.. 
Fri Feb 28 16:16:39 2020 - [info]   done.
Fri Feb 28 16:16:39 2020 - [info] Getting new master's binlog name and position..
Fri Feb 28 16:16:39 2020 - [info]  mysql-bin.000001:2496
Fri Feb 28 16:16:39 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.98.12', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Feb 28 16:16:39 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000001, 2496, 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,
68703597-592c-11ea-88b3-000c2998280b:1-4
Fri Feb 28 16:16:39 2020 - [info] Executing master IP activate script:
Fri Feb 28 16:16:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=start --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --new_master_host=192.168.98.12 --new_master_ip=192.168.98.12 --new_master_port=3306 --new_master_user='mha'   --new_master_password=xxx
Enabling the VIP - 192.168.98.100 on the new master - 192.168.98.12 
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Feb 28 16:16:39 2020 - [info]  OK.
Fri Feb 28 16:16:39 2020 - [info] ** Finished master recovery successfully.
Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] All new slave servers recovered successfully.
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] * Phase 5: New master cleanup phase..
Fri Feb 28 16:16:39 2020 - [info] 
Fri Feb 28 16:16:39 2020 - [info] Resetting slave info on the new master..
Fri Feb 28 16:16:39 2020 - [info]  192.168.98.12: Resetting slave info succeeded.
Fri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2045] Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.
Fri Feb 28 16:16:39 2020 - [info] 

----- Failover Report -----

cls_all: MySQL Master failover 192.168.98.11(192.168.98.11:3306) to 192.168.98.12(192.168.98.12:3306)

Master 192.168.98.11(192.168.98.11:3306) is down!

Check MHA Manager logs at localhost.localdomain:/masterha/cls_all/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.98.11(192.168.98.11:3306)
Selected 192.168.98.12(192.168.98.12:3306) as a new master.
192.168.98.12(192.168.98.12:3306): OK: Applying all logs succeeded.
192.168.98.12(192.168.98.12:3306): OK: Activated master IP address.
192.168.98.12(192.168.98.12:3306): Resetting slave info succeeded.
192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.
Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.
Fri Feb 28 16:16:39 2020 - [info] Sending mail..
sh: /etc/masterha/scripts/send_report: No such file or directory
Fri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2089] Failed to send mail with return code 127:0

只不過由於10無法連通, recover on slave partially failed

192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.
Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.

不過failover成功, vip已經到了12上

root@localhost 16:16:16 [(none)]> \! ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff
    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.98.100/24 scope global secondary ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::ef03:3251:b4ed:204c/64 scope link 
       valid_lft forever preferred_lft forever
root@localhost 16:27:37 [(none)]> show slave status\G
Empty set (0.00 sec)

root@localhost 16:27:43 [(none)]> show global variables like '%read_only%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_read_only      | OFF   |
| read_only             | OFF   |
| super_read_only       | OFF   |
| transaction_read_only | OFF   |
| tx_read_only          | OFF   |
+-----------------------+-------+
5 rows in set (0.00 sec)

MHA 一個slave宕機的影響

文章目錄

環境說明

運行前有節點宕機

運行中有點節點宕機

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

MGR參數之group_replication_ip_whitelist

ProxySQL備份策略

MaoXian web clipper本地程序在macOS Catalina報錯DisconnectErr:Native host has exited.

使用python消費canal protobuf格式數據

0.58 MHA 基於GTID的恢復不會從原Master拉取差異日誌且不再需要relay_log_purge=0!

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結