運維筆記31 (pacemaker高可用集羣搭建的總結)

概述：

pacemaker是heartbeat到了v3版本後拆分出來的資源管理器，所以pacemaker並不提供心跳信息，我們這個集羣還需要corosync（心跳信息）的支持纔算完整。pacemaker的功能是管理整個HA的控制中心，客戶端通過pacemaker來配置管理整個集羣。還有一款幫助我們自動生成配置文件，並且進行節點配置文件同步的crmshell是我們搭建集羣的時候的一個利器。

1.安裝集羣軟件

    yum install pacemaker corosync -y

直接通過yum安裝pacemaker和corosync

crmsh-1.2.6-0.rc2.2.1.x86_64.rpm

pssh-2.3.1-2.1.x86_64.rpm

安裝以上兩個rpm包，其中crmsh對pssh有依賴性。

2.通過crm配置集羣

[root@ha1 ~]# crm
crm(live)#

直接輸入crm(cluster resource manager)進入集羣資源管理器

crm(live)# 
?           cib         exit        node        ra          status      
bye         configure   help        options     resource    up          
cd          end         history     quit        site

輸入tab鍵可以看到相關的管理項

我們現在需要配置集羣，所有進入configure。

ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111)
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

出現瞭如上的錯誤，這應該是沒有開啓corosync服務造成的。就算沒有看到錯誤，我們連心跳層都沒有開更不要談開啓更高層的集羣管理了，所以現在先配置corosync。

[root@ha1 ~]# rpm -ql corosync
/etc/corosync
/etc/corosync/corosync.conf.example

使用rpm命令查找到corosync的配置文件的位置。

將配置文件後的example去掉，配置文件內容修改成如下即可：

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber: 0			
		bindnetaddr: 192.168.5.0		#集羣管理信息所傳送的網段
		mcastaddr: 226.94.1.1			#確定多播地址
		mcastport: 5405				#確定多播端口
		ttl: 1					#只向外多播ttl爲1的報文，防止發生環路
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	to_syslog: yes
	logfile: /var/log/cluster/corosync.log
	debug: off
	timestamp: on
	logger_subsys {
		subsys: AMF
		debug: off
	}
}

amf {
	mode: disabled
}
service {		#讓corosync去加載pacemaker
	name: pacemaker
	ver: 0		#版本號，如果版本號是1的話這個插件不會去啓動pacemaker，如果爲0就會自動啓用pacemaker
}

接下來啓動corosync如果啓動成功，而且日誌中沒有報錯，那麼就成功了。

現在crm應該可以正常使用了。

crm(live)# configure 
crm(live)configure# show
node ha1.mo.com
node ha2.mo.com
property $id="cib-bootstrap-options" \
	dc-version="1.1.10-14.el6-368c726" \
	cluster-infrastructure="classic openais (with plugin)" \
	expected-quorum-votes="2"

[root@ha1 cluster]# crm configure show
node ha1.mo.com
node ha2.mo.com
property $id="cib-bootstrap-options" \
	dc-version="1.1.10-14.el6-368c726" \
	cluster-infrastructure="classic openais (with plugin)" \
	expected-quorum-votes="2"

在bash下輸入相應命令也會顯示，但是沒有了補全。

現在咱們給集羣添加相應服務

先是較爲簡單的ip服務

crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 params ip=192.168.5.100 cidr_netmask=24 op monitor interval=30s

這條命令看似很長，但其實都是補全出來的，你只要理解你的操作，基本不需要記憶就可以配置出來這些，其中ocf表示的是集羣服務腳本，LSB是linux下的標準腳本，也就是放置在/etc/init.d下的腳本。

每次修改了一下配置文件，並不是馬上就被保存並輸出成程序可讀的xml，需要你進行commit操作纔可以。

crm(live)configure# commit
   error: unpack_resources: 	Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources: 	Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources: 	NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
Do you still want to commit?

我提交後出現瞭如上的錯誤，是STONITH的問題，說我們定義了STONITH，但是沒進行配置，這裏我們先不管，因爲我們添加的是ip服務，直接確定提交。注意確認提交後，服務就會生效了。

我們通過crm自帶的查看功能看一下服務是否正常。

crm(live)configure# cd
crm(live)# resource 
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Stopped 
crm(live)resource# start vip
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Stopped

通過cd回到一開始的目錄下，然後進入resource查看資源情況，發現沒有啓動這就很奇怪了，手動啓動後仍然失敗，說明配置有問題，我們查看下日誌。

GINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

只發現了STONITH的錯誤，我們嘗試關閉STONITH。

crm(live)configure# property stonith-enabled=false
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Started

發現服務已經正常。所以一定要清除ERROR。經過上面的操作，大家一定感覺這個pacemaker很好用，配置集羣的時候只要在一個節點上修改，所有節點就都修改好了不用再繼續分發操作。

現在測試一下是否有健康檢查，關閉ha1的網絡

[root@ha2 ~]# crm_mon

Last updated: Mon Feb 27 07:30:23 2017
Last change: Mon Feb 27 07:16:50 2017 via cibadmin on ha1.mo.com
Stack: classic openais (with plugin)
Current DC: ha2.mo.com - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
1 Resources configured

Online: [ ha2.mo.com ]
OFFLINE: [ ha1.mo.com ]

一般STONITH是一個硬件設備，我們的服務是虛擬機，所以需要一個虛擬的fence設備。

[root@ha1 ~]# stonith_admin -I
 fence_pcmk
 fence_legacy
2 devices found

查看已經安裝的fence設備，沒有我們需要的fence_xvm。我們查一下萬能的yum

fence-virt.x86_64 : A pluggable fencing framework for virtual machines

發現這個很符合我們的需求，安裝看一下

[root@ha1 ~]# stonith_admin -I
 fence_xvm
 fence_virt
 fence_pcmk
 fence_legacy
4 devices found

現在就有了我們需要的fence_xvm

[root@ha1 ~]# stonith_admin -M -a fence_xvm

使用上面命令添加fence代理
進入crm將fence的配置添加進去。

crm(live)configure# primitive vmfence stonith:fence_xvm params pcmk_host_map="ha1.mo.com:ha1;ha2.mo.com:ha2" op monitor interval=20s

上面的pcmk_host_map代表的是虛擬機的主機名和虛擬機的域名的對應關係。
現在查看一下fence的運行狀況

vmfence (stonith:fence_xvm):    Started ha2.mo.com

現在添加一個http服務測試一下。

crm(live)configure# primitive apache lsb:httpd op monitor interval=30s

查看運行情況
現在結合一下我們前幾天學的RHCS套件，ip和http服務的啓動順序是要由先後的，所以我們接下來要定義服務的先後順序。

crm(live)configure# group website vip apache

這樣就將vip和apache綁定成了一個組，而且是vip先啓動然後是http服務。現在看一下服務的狀態

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Resource Group: website
     vip	(ocf::heartbeat:IPaddr2):	Started 
     apache	(lsb:httpd):	Started

現在一個服務的基本雛形已經出來了，我們測試一下fence是否有效。關閉ha1的http服務。

Failed actions:
    apache_monitor_30000 on ha1.mo.com 'not running' (7): call=27, status=complete, last-rc-change='Mon Feb 27 22:32:36 2017', queued=0ms, exec=0ms

通過在ha2上對集羣的觀察，集羣已經發現了ha1上的http服務關閉，但是並沒有啓動fence，而是直接開啓了ha1的http服務。
現在讓ha1的網卡掛掉

2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: UNCLEAN (offline)
Online: [ ha2.mo.com ]

 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

出現了一個奇怪的現象，服務並沒有進行切換，仍然在ha1上。原來pacemaker有一個法定人數的選項我們沒有設置，如果開啓，集羣就會認爲當節點少於2個節點集羣就壞掉了，在實際情況下，是一種容災策略。

crm(live)configure# property no-quorum-policy=ignore

將這條輸入，繼續測試，當前服務在2上，現在將2的網卡關閉

Last change: Mon Feb 27 22:46:35 2017 via cibadmin on ha2.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Online: [ ha1.mo.com ha2.mo.com ]

vmfence (stonith:fence_xvm):    Started ha1.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

可以看到服務切到了1上，而且ha2關機了。

現在將ldirectord服務加上，這樣我們的集羣就具備對lvs的操作功能了。關於ldirectord的配置在上一章博客上已經有說明，這裏我們要配置一個虛擬ip是172.25.3.100，分配負載的兩節點ip是172.25.3.3和172.25.3.4。

現在將ldirectord加入配置文件

crm(live)configure# primitive lvs lsb:ldirectord op  monitor interval=30s

接下來我們要爲這個website添加存儲服務。在這之前介紹幾條命令，用於讓某個節點下線和上線。

Last updated: Tue Feb 28 22:35:00 2017
Last change: Tue Feb 28 22:34:04 2017 via cibadmin on ha1.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: standby
Online: [ ha2.mo.com ]

vmfence (stonith:fence_xvm):    Started ha2.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha2.mo.com
     apache     (lsb:httpd):    Started ha2.mo.com

現在服務在ha2上運行，讓ha2掉線看結果

Last updated: Tue Feb 28 22:37:21 2017
Last change: Tue Feb 28 22:37:21 2017 via crm_attribute	on ha2.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: standby
Node ha2.mo.com: standby

現在兩節點都處在standby狀態，我們讓ha1上線

Node ha2.mo.com: standby
Online: [ ha1.mo.com ]

vmfence (stonith:fence_xvm):    Started ha1.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

ha1開始接管

如果配置文件已經確實沒有錯誤了，但是服務依舊起不來，比如我開啓集羣后，忘記開啓真機的fence_virtd導致虛擬機的vmfence無法啓動，可以嘗試下面的命令，cleanup的作用就是刷新資源的狀態

crm(live)resource# cleanup vmfence

Cleaning up vmfence on ha1.mo.com
Cleaning up vmfence on ha2.mo.com
Waiting for 1 replies from the CRMd. OK

現在查看一下各個資源腳本的一些要求

start and stop Apache HTTP Server (lsb:httpd)

The Apache HTTP Server is an efficient and extensible  \
 	       server implementing the current HTTP standards.

Operations' defaults (advisory minimum):

    start         timeout=15
    stop          timeout=15
    status        timeout=15
    restart       timeout=15
    force-reload  timeout=15
    monitor       timeout=15 interval=15

以上是apache腳本的一些介紹。

接下來爲集羣添加一個drbd共享存儲和mysql服務。

首先爲ha1和ha2加入兩塊4G的硬盤，關於DRBD從源碼包成爲rpm包的具體過程可以傳送門

[root@ha1 x86_64]# ls
drbd-8.4.2-2.el6.x86_64.rpm                  drbd-heartbeat-8.4.2-2.el6.x86_64.rpm                 drbd-pacemaker-8.4.2-2.el6.x86_64.rpm  drbd-xen-8.4.2-2.el6.x86_64.rpm
drbd-bash-completion-8.4.2-2.el6.x86_64.rpm  drbd-km-2.6.32_431.el6.x86_64-8.4.2-2.el6.x86_64.rpm  drbd-udev-8.4.2-2.el6.x86_64.rpm
drbd-debuginfo-8.4.2-2.el6.x86_64.rpm        drbd-km-debuginfo-8.4.2-2.el6.x86_64.rpm              drbd-utils-8.4.2-2.el6.x86_64.rpm

最終生成的rpm包。之後下載mysql，將mysql的文件放到drbd的共享存儲下。

將drbd的meta數據創建好，啓動服務，強制爲primary，這裏注意你的drbd底層存儲一定不能格式化過，否則你怎樣強制primary都不會成功的，我已經犯了兩次錯誤了。將drbd設備掛載到/var/lib/mysql也就是mysql的根目錄，這樣mysql的數據就在drbd設備中了。切記停止mysql再去切換drbd的主備，不要讓drbd的存儲中有mysql的sock文件存在。

現在將dbrd服務關閉，開始讓pacemaker集羣接管。

首先添加drbd資源

crm(live)resource# primitive drbddata ocf:linbit:drbd params drbd_resource=mo op monitor interval=120s

這次使用的腳本是ocf的linbit，且一定要定義drbd_resource

設置drbd的主備

crm(live)resource# ms drbdclone drbddata meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

設置drbd設備的掛載

crm(live)resource# primitive sqlfs ocf:heartbeat:Filesystem params device=/dev/drbd1 directory=/var/lib/mysql fstype=ext4

將sqlfs和drbd設置到一個聯合裏面，方便後面定義啓動順序

crm(live)resource# colocation sqlfs-with-drbd inf: sqlfs drbdclone:Master

設置當drbd爲主設備的時候才啓動文件系統

crm(live)resource# order sqlfs-after-drbd inf: drbdclone:promote sqlfs:start

現在commit一下，看下是否生效。如果出現時間上的warning可以先暫時不理他們。

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Resource Group: website
     vip	(ocf::heartbeat:IPaddr2):	Started 
     apache	(lsb:httpd):	Started 
     sqlfs	(ocf::heartbeat:Filesystem):	Started 
 Master/Slave Set: drbdclone [drbddata]
     Masters: [ ha1.mo.com ]

可以看到服務正常運行

最後將mysql服務的配置添加進入配置文件中

crm(live)configure# primitive mysql lsb:mysqld op monitor interval=60s

crm(live)configure# group mydb vip sqlfs mysql

再刪除之前的website組現在觀察一下服務是否正常。

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Master/Slave Set: drbdclone [drbddata]
     Masters: [ ha2.mo.com ]
     Stopped: [ ha1.mo.com ]
 apache	(lsb:httpd):	Started 
 Resource Group: mydb
     vip	(ocf::heartbeat:IPaddr2):	Started 
     sqlfs	(ocf::heartbeat:Filesystem):	Started 
     mysql	(lsb:mysqld):	Started

運維筆記31 (pacemaker高可用集羣搭建的總結)

移位操作搞定兩數之商

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

運維筆記19 （DNS服務器bind的相關配置,主從DNS服務器，動態域名解析的簡單配置）

運維筆記23 （shell腳本，expect的簡易用法）

運維筆記31 (pacemaker高可用集羣搭建的總結)

運維筆記28 （在集羣上部署ip，http，存儲等）

運維筆記21 （郵件服務器的搭建）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結