MariaDB galera cluster 全部停止後如何再啓動

MariaDB galera cluster 全部停止後如何再啓動

一、問題場景

1.正式環境下基本上不會出現此類情況

2.測試環境的時候可能會出現,如自己電腦上搞的幾個虛擬機上測試,後來全部關機了,再想啓動集羣,報錯了

【系統環境】

CentOS7 + MariaDB10.1.22+galera cluster

【解決方式】

1.正常第一次啓動集羣,使用命令:galera_new_cluster ,其他版本請另行參考

2.整個集羣關閉後,再重新啓動,則打開任一主機,輸入命令:

vim /var/lib/mysql/grastate.dat

#GALERA savedd state
version:2.1
uuid: 自己的cluster id
seqno: -1
safe_to_bootstrap:0

修改seqno:1

3.重新啓動集羣命令:galera_new_cluster

4.其他節點:systemctl start mariadb

 

問題二:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

[root@controller1 haproxy]# galera_new_cluster

Job for mariadb.service failed because the control process exited with error code.

See "systemctl status mariadb.service" and "journalctl -xe" for details.

[root@controller1 haproxy]# tail /var/log/mariadb/mariadb.log

2018-03-21 12:16:18 140168333977920 [Note] WSREP: GCache history reset: f84f94a1-2c38-11e8-8ede-96f87262fb85:0 -> f84f94a1-2c38-11e8-8ede-96f87262fb85:-1

2018-03-21 12:16:18 140168333977920 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1

2018-03-21 12:16:18 140168333977920 [Note] WSREP: wsrep_sst_grab()

2018-03-21 12:16:18 140168333977920 [Note] WSREP: Start replication

2018-03-21 12:16:18 140168333977920 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster

2018-03-21 12:16:18 140168333977920 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1

2018-03-21 12:16:18 140168333977920 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

2018-03-21 12:16:18 140168333977920 [ERROR] WSREP: wsrep::connect(gcomm://controller1,controller2,controller3) failed: 7

2018-03-21 12:16:18 140168333977920 [ERROR] Aborting

 

#解決辦法

[root@controller1 haproxy]# cat /var/lib/mysql/grastate.dat

# GALERA saved state

version: 2.1

uuid:    d6aea58b-2cbe-11e8-9c9d-b72d8fdd0931

seqno:   -1

safe_to_bootstrap: 0 

 

把safe_to_bootstrap: 0   #修改成safe_to_bootstrap: 1

 

#再啓動集羣

[root@controller1 haproxy]# galera_new_cluster #其他節點啓動服務:systemctl start mariadb

 

二、mysql galera 集羣常見問題處理

一、mysql HA集羣在斷網過久或者所有節點都down了之後的恢復有以下的方法:
解決方案1:
1、等三臺機器恢復網絡通訊後,因爲此時的mysql已經異常無法加入集羣,因此需要先保證所有的mysql都是down的,再上臺執行/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://' & 這條命令,並進入mysql(只有一臺機器能夠成功執行,其他機器執行了過幾秒鐘都會異常退出這個進程,我們這裏把能夠成功執行的機器稱爲master)
2、此時三臺只有一臺能夠成功進入mysql(即執行mysql這條命令),在非master上的兩臺上一臺一臺的執行systemctl start mysqld,必須等一臺成功了,另一臺才能執行。

3、在mysql中執行show status like "wsrep%";結果如下圖:

 我們需要保證圖中的第一項爲synced,以及第二項必須爲三個mysql的ip

4、保證3的結果是想要的說明集羣已經恢復了,此時需要將master機器上面的/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://'這個進程kill掉,然後再執行systemctl start mysqld即可

二、mysql HA集羣某個節點無故down了並且有一段時間處於down的情況通過以下方式恢復:

1、 若日誌裏面出現以下日誌

160119 14:11:05 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (eb9f50c6-bc95-11e5-a735-9f48e437dc03): 1 (Operation not permitted)

解決方法:刪除/var/lib/mysql/grastate.dat 文件(若還存在無法同步的情況則刪除galera.cache文件)

2、 若那個down了的節點出現以下日誌

(異常情況集羣掛了)[ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions

解決方法:
1、/usr/libexec/mysqld start --innodb_force_recovery=6
 1. (SRV_FORCE_IGNORE_CORRUPT):忽略檢查到的corrupt頁。
  2. (SRV_FORCE_NO_BACKGROUND):阻止主線程的運行,如主線程需要執行full purge操作,會導致crash。
  3. (SRV_FORCE_NO_TRX_UNDO):不執行事務回滾操作。
  4. (SRV_FORCE_NO_IBUF_MERGE):不執行插入緩衝的合併操作。
  5. (SRV_FORCE_NO_UNDO_LOG_SCAN):不查看重做日誌,InnoDB存儲引擎會將未提交的事務視爲已提交。
  6. (SRV_FORCE_NO_LOG_REDO):不執行前滾的操作。
如果配置後出現以下情況:
130507 14:14:01  InnoDB: Waiting for the background threads to start
130507 14:14:02  InnoDB: Waiting for the background threads to start
130507 14:14:03  InnoDB: Waiting for the background threads to start
130507 14:14:04  InnoDB: Waiting for the background threads to start
130507 14:14:05  InnoDB: Waiting for the background threads to start
130507 14:14:06  InnoDB: Waiting for the background threads to start
130507 14:14:07  InnoDB: Waiting for the background threads to start
130507 14:14:08  InnoDB: Waiting for the background threads to start
130507 14:14:09  InnoDB: Waiting for the background threads to start


需要在galera.cfg中添加這一下:
如果在設置 innodb_force_recovery >2 的同時innodb_purge_thread = 0
2、mysqld --tc-heuristic-recover=ROLLBACK
3、刪除/var/lib/mysql/ib_logfile*
4、當某個mysql節點掛了,並且存在三個mysql所在host有不同的網段,當mysql想重新加入需要一個sst的過程,sst時會需要知道集羣中某個節點的ip因此需要制定參數--wsrep-sst-receive-address否則可能出現同步的ip不在三臺機器所共有的網段
解決參考:
http://blog.itpub.net/22664653/viewspace-1441389/


三、一個mysql節點若down了一段時間。重新啓動的時候需要一些時間去同步數據,服務的啓動超時時間不夠,導致服務無法啓動,解決方法如下:
The correct way to adjust systemd settings so they don't get overwritten is to create a directory and file as such:
/etc/systemd/system/mariadb.service.d/timeout.conf
[Service]
 
TimeoutStartSec=12min


或者直接修改/usr/lib/systemd/system/mariadb.service
[Service]
 
TimeoutStartSec=12min
這裏的時間最少要大於90s,默認是90s之後執行 systemctl daemon-reload再重啓服務即可
四、日誌中出現類似如下錯誤:
160428 13:54:49 [ERROR] Slave SQL: Error 'Table 'manage_operations' already exists' on query. Default database: 'horizon'. Query: 'CREATE TABLE `manage_operations` (
    `id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `name` varchar(50) NOT NULL,
    `type` varchar(20) NOT NULL,
    `operation` varchar(20) NOT NULL,
    `status` varchar(20) NOT NULL,
    `time` date NOT NULL,
    `operator` varchar(50) NOT NULL
) default charset=utf8', Error_code: 1050
160428 13:54:49 [Warning] WSREP: RBR event 1 Query apply warning: 1, 28585
160428 13:54:49 [Warning] WSREP: Ignoring error for TO isolated action: source: 752eecd1-0ce0-11e6-83fc-3e0502d0bdd2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24053 trx_id: -1 seqnos (l: 28668, g: 28585, s: 28584, d: 28584, ts: 80224119986850)
導致進程異常關閉,
此時可以通過執行mysqladmin flush-tables來刷新表項,這個問題的原因是三個節點之間的表同步存在問題,刷新一下表即可


五、日誌出現以下錯誤:
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367194
160520 10:48:23 [Note] WSREP: cert failure, thd: 358780 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358784 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367188
160520 10:48:23 [Note] WSREP: cert failure, thd: 359683 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358808 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367191 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367196 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367194 is_AC: 0, retry: 0 - 1 SQL: commit

160520 10:48:23 [Note] WSREP: cert failure, thd: 367188 is_AC: 0, retry: 0 - 1 SQL: commit

8、日誌出現以下錯誤:

160820  3:13:41 [ERROR] Error in accept: Too many open files
160820  3:19:42 [ERROR] Error in accept: Too many open files
160827  3:16:24 [ERROR] Error in accept: Too many open files
160831 17:20:52 [ERROR] Error in accept: Too many open files
160831 19:54:29 [ERROR] Error in accept: Too many open files
160831 20:21:53 [ERROR] Error in accept: Too many open files
160901 11:25:57 [ERROR] Error in accept: Too many open files

解決方法

vim /usr/lib/systemd/system/mariadb.service

 [Service]
 LimitNOFILE=10000

默認的mysql的open_file_limits是1024將該項增大,並且修改vim /etc/my.cnf.d/server.cnf該文件的open_files_limit值

systemctl daemon-reload

systemctl restart mysqld

查看mysql的open_file_limits值是否調整成功

cat /proc/$pid/limit

其中$pid爲mysql進程的pid看看值是否調整成功,並看看日誌是否還會出現上述錯誤。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章