MHA複製檢測時提示slave IO線程沒有運行的處理過程

一、mysql架構描述
主庫master:192.168.66.202 port:3306
從庫slave1:192.168.66.203 port:3306
從庫slave2:192.168.66.204 port:3306
VIP: 192.168.66.235
mariadb 10.1.18 
centos 6.6
一主二從+MHA
二、問題描述
使用MHA的複製檢測腳本執行時,報如下提示:有一個從庫IO線程沒有運行,具體如下:
..........................................................................
Wed Feb 22 17:28:00 2017 - [info] Checking replication health on 192.168.66.203..
Wed Feb 22 17:28:01 2017 - [info] ok.
Wed Feb 22 17:28:01 2017 - [info] Checking replication health on 192.168.66.204..
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/Server.pm, ln485] Slave IO thread is not running on 192.168.66.204(192.168.66.204:3306)
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln1526] failed!
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at 
/usr/local/share/perl5/MHA/MasterMonitor.pm line 417
Wed Feb 22 17:28:01 2017 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Wed Feb 22 17:28:01 2017 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
三、原因分析
在slave1上查看進程和slave狀態:
pager cat | egrep -i 'system user|Exec_Master_Log_Pos|Seconds_Behind_Master|Read_Master_Log_Pos|Master_Log_File|Relay_Master_Log_File|Slave_IO_Running| 
Slave_SQL_Running|Master_Host|Master_User|Master_Port'
MariaDB [(none)]> show processlist; show slave status\G
| 2732923 | system user | | NULL | Connect | 16365 | Queueing master event to the relay log | NULL | 
0.000 |
| 2732924 | system user | | NULL | Connect | 0 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 
0.000 |
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
1 row in set (1.04 sec)
在slave2上查看進程和slave狀態:
pager cat | egrep -i 'system user|Exec_Master_Log_Pos|Seconds_Behind_Master|Read_Master_Log_Pos|Master_Log_File|Relay_Master_Log_File|Slave_IO_Running| 
Slave_SQL_Running|Master_Host|Master_User|Master_Port'
MariaDB [(none)]> show processlist; show slave status\G
| 2734714 | system user | | NULL | Connect | 16413 | Queueing master event to the relay log | NULL | 
0.000 |
| 2734715 | system user | | NULL | Connect | 0 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 
0.000 |
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 16415
1 row in set (1.05 sec)
上面看到從庫的IO線程有在運行,且Master_Log_File=Relay_Master_Log_File和Read_Master_Log_Pos=Exec_Master_Log_Pos ,爲什麼會提示:192.168.66.204的IO線程沒有運行呢?
上面還發現IO thread長時間一直處於:Queueing master event to the relay log,非常奇怪。

在master上查看slave數量:
MariaDB [(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 2 | | 3306 | 1 |
+-----------+------+------+-----------+
1 row in set (0.00 sec)
正常情況會顯示兩個從庫,而上面查到master上只有一個從庫,其server id爲2,在slave1從庫上查看對應server_id爲2:
MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+
1 row in set (0.00 sec)
而第二個從庫slave2查到servier_id也爲2:
MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+
1 row in set (0.00 sec)
我們再看error log:
slave1的錯誤日誌:
2017-02-22 22:45:13 47275402185472 [Note] Slave: received end packet from server, apparent master shutdown: 
2017-02-22 22:45:13 47275402185472 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000002' at position 2206
slave2的錯誤日誌:
2017-02-22 22:44:26 47107924593408 [Note] Slave: received end packet from server, apparent master shutdown: 
2017-02-22 22:44:26 47107924593408 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000002' at position 2206
上面報錯顯示IO線程讀取主庫的event失敗,正在重新連接,這也說明了爲什麼slave的IO線程狀態長時間一直處於Queueing master event to the relay log。問題發生的原因很簡單了,那就是兩個從庫的server id都爲2引起,mysql主從的server id一定不能相同,一般建議用ip最後一段+端口號,如IP爲192.168.66.203,端口號爲3306, 則server id爲2033306,這樣就能保證一個主從複製架構中每臺mysql 實例的server id號不會相同。
server-id做什麼用的呢? 
1、 mysql的同步的數據中是包含server-id的,用於標識該語句最初是從哪個server寫入的,所以server-id一定要有的; 
2、 每一個同步中的slave在master上都對應一個master線程,該線程就是通過slave的server-id來標識的;每個slave在master端最多有一個master線程,如果兩個slave的server-id 相 
同,則後一個連接成功時,前一個將被踢掉。 這裏至少有這麼一種考慮:
slave主動連接master之後,如果slave上面執行了slave stop;則連接斷開,但是master上對應的線程並沒有退出;當slave start之後,master不能再創建一個線程而保留原來的線程,那 
樣同步就可能有問題.
四、問題解決:
將主庫和從庫的server id分別改爲:
master--->2023306
slave1--->2033306
slave2--->2043306
主庫:
MariaDB [(none)]> set global server_id =2023306
-> ;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2023306 |
+---------------+---------+
從庫1:
set global server_id =2033306
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2033306 |
+---------------+---------+
從庫2:
set global server_id =2043306
MariaDB [(none)]> show global variables like 'server_id';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| server_id | 2043306 |
+---------------+---------+
爲防止重啓mysql後,server_id改回原來設置,所以同時需要在my.cnf文件中作相應修改。兩個從庫都重啓slave:
MariaDB [(none)]> stop slave;
MariaDB [(none)]> start slave;
slave1:
MariaDB [(none)]> show processlist; show slave status\G
| 2747608 | system user | | NULL | Connect | 204 | Waiting for master to send event | NULL | 0.000 
|
| 2747609 | system user | | NULL | Connect | 202 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000 
|
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
slave2:
MariaDB [(none)]> show processlist; show slave status\G
| 2749407 | system user | | NULL | Connect | 131 | Waiting for master to send event | NULL | 0.000 
|
| 2749408 | system user | | NULL | Connect | 128 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000 
|
3 rows in set (0.00 sec)
Master_Host: 192.168.66.202
Master_User: rep
Master_Port: 3306
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 2206
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2206
Seconds_Behind_Master: 0
1 row in set (0.00 sec)

現在從庫IO線程狀態顯示Waiting for master to send event,這個纔是正常狀態。
master上現在可以看到兩個從庫有連接和註冊上。
MariaDB [(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 2033306 | | 3306 | 2023306 |
| 2043306 | | 3306 | 2023306 |
+-----------+------+------+-----------+
2 rows in set (0.00 sec)
最後,我們再用mha腳本檢測一下複製是否正常:
masterha_check_repl --global_conf=/apps/conf/mha/masterha_base.cnf --conf=/apps/conf/mha/app1.cnf
......
192.168.66.202(192.168.66.202:3306) (current master)
+--192.168.66.203(192.168.66.203:3306)
+--192.168.66.204(192.168.66.204:3306)


Wed Feb 22 23:20:13 2017 - [info] Checking replication health on 192.168.66.203..
Wed Feb 22 23:20:13 2017 - [info] ok.
Wed Feb 22 23:20:13 2017 - [info] Checking replication health on 192.168.66.204..
Wed Feb 22 23:20:13 2017 - [info] ok.
Wed Feb 22 23:20:13 2017 - [info] Checking master_ip_failover_script status:
Wed Feb 22 23:20:13 2017 - [info] /apps/sh/mha/script/master_ip_failover --command=status --ssh_user=apps --orig_master_host=192.168.66.202 -- 
orig_master_ip=192.168.66.202 --orig_master_port=3306 
Wed Feb 22 23:20:13 2017 - [info] OK.
Wed Feb 22 23:20:13 2017 - [warning] shutdown_script is not defined.
Wed Feb 22 23:20:13 2017 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.


 

發佈了197 篇原創文章 · 獲贊 23 · 訪問量 48萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章