MHA複製檢測時提示slave IO線程沒有運行的處理過程

一、mysql架構描述
主庫master：192.168.66.202 port:3306
從庫slave1:192.168.66.203 port:3306
從庫slave2:192.168.66.204 port:3306
VIP: 192.168.66.235
mariadb 10.1.18
centos 6.6
一主二從+MHA
二、問題描述
使用MHA的複製檢測腳本執行時，報如下提示：有一個從庫IO線程沒有運行，具體如下:
..........................................................................
Wed Feb 22 17:28:00 2017 - [info] Checking replication health on 192.168.66.203..
Wed Feb 22 17:28:01 2017 - [info] ok.
Wed Feb 22 17:28:01 2017 - [info] Checking replication health on 192.168.66.204..
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/Server.pm, ln485] Slave IO thread is not running on 192.168.66.204(192.168.66.204:3306)
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln1526] failed!
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at
/usr/local/share/perl5/MHA/MasterMonitor.pm line 417
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Wed Feb 22 17:28:01 2017 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
三、原因分析
在slave1上查看進程和slave狀態：
pager cat | egrep -i 'system user|Exec_Master_Log_Pos|Seconds_Behind_Master|Read_Master_Log_Pos|Master_Log_File|Relay_Master_Log_File|Slave_IO_Running|
Slave_SQL_Running|Master_Host|Master_User|Master_Port'
MariaDB [(none)]> show processlist; show slave status\G
| 2732923 | system user | | NULL | Connect | 16365 | Queueing master event to the relay log | NULL |
0.000 |
| 2732924 | system user | | NULL | Connect | 0 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL |
0.000 |
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
1 row in set (1.04 sec)
在slave2上查看進程和slave狀態：
pager cat | egrep -i 'system user|Exec_Master_Log_Pos|Seconds_Behind_Master|Read_Master_Log_Pos|Master_Log_File|Relay_Master_Log_File|Slave_IO_Running|
Slave_SQL_Running|Master_Host|Master_User|Master_Port'
MariaDB [(none)]> show processlist; show slave status\G
| 2734714 | system user | | NULL | Connect | 16413 | Queueing master event to the relay log | NULL |
0.000 |
| 2734715 | system user | | NULL | Connect | 0 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL |
0.000 |
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 16415
1 row in set (1.05 sec)
上面看到從庫的IO線程有在運行,且Master_Log_File=Relay_Master_Log_File和Read_Master_Log_Pos=Exec_Master_Log_Pos ,爲什麼會提示：192.168.66.204的IO線程沒有運行呢？
上面還發現IO thread長時間一直處於:Queueing master event to the relay log,非常奇怪。

在master上查看slave數量：
MariaDB [(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 2 | | 3306 | 1 |
+-----------+------+------+-----------+
1 row in set (0.00 sec)
正常情況會顯示兩個從庫，而上面查到master上只有一個從庫，其server id爲2,在slave1從庫上查看對應server_id爲2：
MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+
1 row in set (0.00 sec)
而第二個從庫slave2查到servier_id也爲2：
MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+
1 row in set (0.00 sec)
我們再看error log:
slave1的錯誤日誌：
2017-02-22 22:45:13 47275402185472 [Note] Slave: received end packet from server, apparent master shutdown:
2017-02-22 22:45:13 47275402185472 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000002' at position 2206
slave2的錯誤日誌：
2017-02-22 22:44:26 47107924593408 [Note] Slave: received end packet from server, apparent master shutdown:
2017-02-22 22:44:26 47107924593408 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000002' at position 2206
上面報錯顯示IO線程讀取主庫的event失敗,正在重新連接，這也說明了爲什麼slave的IO線程狀態長時間一直處於Queueing master event to the relay log。問題發生的原因很簡單了，那就是兩個從庫的server id都爲2引起，mysql主從的server id一定不能相同,一般建議用ip最後一段+端口號，如IP爲192.168.66.203，端口號爲3306，則server id爲2033306,這樣就能保證一個主從複製架構中每臺mysql 實例的server id號不會相同。
server-id做什麼用的呢？
1、 mysql的同步的數據中是包含server-id的，用於標識該語句最初是從哪個server寫入的，所以server-id一定要有的;
2、每一個同步中的slave在master上都對應一個master線程，該線程就是通過slave的server-id來標識的；每個slave在master端最多有一個master線程，如果兩個slave的server-id 相
同，則後一個連接成功時，前一個將被踢掉。這裏至少有這麼一種考慮：
slave主動連接master之後，如果slave上面執行了slave stop；則連接斷開，但是master上對應的線程並沒有退出；當slave start之後，master不能再創建一個線程而保留原來的線程，那
樣同步就可能有問題.
四、問題解決：
將主庫和從庫的server id分別改爲:
master--->2023306
slave1--->2033306
slave2--->2043306
主庫：
MariaDB [(none)]> set global server_id =2023306
-> ;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2023306 |
+---------------+---------+
從庫1：
set global server_id =2033306
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2033306 |
+---------------+---------+
從庫2：
set global server_id =2043306
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2043306 |
+---------------+---------+
爲防止重啓mysql後，server_id改回原來設置，所以同時需要在my.cnf文件中作相應修改。兩個從庫都重啓slave:
MariaDB [(none)]> stop slave;
MariaDB [(none)]> start slave;
slave1:
MariaDB [(none)]> show processlist; show slave status\G
| 2747608 | system user | | NULL | Connect | 204 | Waiting for master to send event | NULL | 0.000
|
| 2747609 | system user | | NULL | Connect | 202 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000
|
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
slave2:
MariaDB [(none)]> show processlist; show slave status\G
| 2749407 | system user | | NULL | Connect | 131 | Waiting for master to send event | NULL | 0.000
|
| 2749408 | system user | | NULL | Connect | 128 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000
|
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
1 row in set (0.00 sec)

現在從庫IO線程狀態顯示Waiting for master to send event，這個纔是正常狀態。
master上現在可以看到兩個從庫有連接和註冊上。
MariaDB [(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 2033306 | | 3306 | 2023306 |
| 2043306 | | 3306 | 2023306 |
+-----------+------+------+-----------+
2 rows in set (0.00 sec)
最後，我們再用mha腳本檢測一下複製是否正常:
masterha_check_repl --global_conf=/apps/conf/mha/masterha_base.cnf --conf=/apps/conf/mha/app1.cnf
......
192.168.66.202(192.168.66.202:3306) (current master)
+--192.168.66.203(192.168.66.203:3306)
+--192.168.66.204(192.168.66.204:3306)

Wed Feb 22 23:20:13 2017 - [info] Checking replication health on 192.168.66.203..
Wed Feb 22 23:20:13 2017 - [info] ok.
Wed Feb 22 23:20:13 2017 - [info] Checking replication health on 192.168.66.204..
Wed Feb 22 23:20:13 2017 - [info] ok.
Wed Feb 22 23:20:13 2017 - [info] Checking master_ip_failover_script status:
Wed Feb 22 23:20:13 2017 - [info] /apps/sh/mha/script/master_ip_failover --command=status --ssh_user=apps --orig_master_host=192.168.66.202 --
orig_master_ip=192.168.66.202 --orig_master_port=3306
Wed Feb 22 23:20:13 2017 - [info] OK.
Wed Feb 22 23:20:13 2017 - [warning] shutdown_script is not defined.
Wed Feb 22 23:20:13 2017 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.

zengxuewen2045

發佈了197 篇原創文章 · 獲贊 23 · 訪問量 48萬+

私信關注

MHA複製檢測時提示slave IO線程沒有運行的處理過程

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

replication-manager搭建部署

之mysql執行計劃分析學習記錄

mysql恢復報ERROR 2006 (HY000) at line 5303856: MySQL server has gone away錯誤處理

oracle 11G rac服務不能停止

如何利用python從mysql中將數據導出到excel

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結