第一部分:集羣中事務決策各層簡介
· Messageing Layer:心跳信息傳遞層
· ha_aware :集羣事務決策軟件。自己能夠利用底層心跳信息傳遞層的功能,調用他的api完成事務決策的軟件。
· DC:Designated Coordinator:(選定的協調員)爲了防止主節點掛掉之後從節點爭搶。有多從節點推舉產生。
· CRM: (董事長)Cluster Resources Manager:負責做出決策。高可用集羣中,任何資源都不應該自行啓動,而是用CRM管理啓動與否。
· LRM:(總經理)local resources Manager;讓CRM的決策落實執行。真正管理本地資源,讓本地資源啓動停止,和狀態監控。
· RA:resource agent :RA能夠接受CRM的調度用於實現在節點上對某一個資源進行管理的工具,這個工具通常就是腳本。任何一個資源配置都需要依賴一個腳本或者程序。(需要接受四個參數{start|stop|status|restart} status:輸出狀態只能是running和stopped
failover:失效轉移,故障轉移
failback:失效轉回,故障轉回
每個層次中所用到的軟件
· Messaging Layer:
heartbeat v1,v2,v3
corosync(openAIS):可用性事務委員會:定義開放的工業標準,爲了能夠讓其理念讓大家熟悉,openAIS推出了一個樣例模版。corosync
cman(紅帽):集羣管理器。
· CRM:集羣資源管理器:只要接口和信息層兼容,就可以獨立使用。
heartbeat v1:haresour:配置文件 ;只是一個配置接口,
heartbeat v2:crm:(各節點都運行進程crmd;端口5566,客戶端crmsh(shell)用戶體驗差):由於其不易配置,有人提供了接口heartbeat-GUI
heartbeat v3:heartbeat+pacemaker+ cluster-glue(黏合器):
pacemaker:(獨立成了一個項目pacemaker)
配置接口:第三方開放配置工具,
CLI:命令行工具,crm(suse),pcs,---python語言研發
GUI:hawk(web界面),LCMC(窗口界面),pacemaker-mgmt
cman + rgmanager:
resource group manager:Failover Domanin:資源組管理器
配置接口:
RHCS:RedHat Cluster Suite
配置接口:Conga(完全生命週期的配置接口)安裝,配置,資源,啓動
· RA的類型。
heartbeat legacy:heartbeat的傳統類型
LSB:腳本/etc/rc.d/init.d/*
OCF:Open Cluster Framework:開放集羣框架
provider:pacemaker:資源代理腳本的組織
linbit :提供的資源代理
STONITH:節點隔離
· keepalived:命令。輕量級,跟前邊的風格不同。藉助vrrp協議完成ip地址資源流轉,並利用自己內部的實現腳本調用的接口完成高可用功能。vrrp:虛擬路由冗餘協議。
應用場景:
keepalived+ipvs
keepalived+haproxy
· RHEL OR CentOS高可用集羣解決方案:
紅帽5:
自帶:RHCS(camn+rgmanager)
選用第三方:corosync+pacemaker,heartbeat(v1或v2),keepalived
紅帽6:
自帶:RHCS(cman+rgmanager)
corosync+rgmanager
cman+pacemaker
heartbeat v3 + pacemaker
keepalived
應用方法:
做前端負載均衡的高可用:keepalived
做大規模的高可用集羣:corosync或(cman+pacemaker)支持多大100個節點
使用共享存儲的時候,爲了加快速度,存儲元數據會保存的內存中,如果有兩個集羣節點同時對數據寫操作,會導致文件系統崩潰。
實現資源隔離的方式:
可能需要藉助硬件;1、硬件芯片(需要認證機制)。2、切斷電源交換機。3、ssh
兩個節點的集羣是個特殊集羣,斷開的時候無法決策,此時可以使用ping node判斷。
仲裁設備:
1 ping node ping group
2 qdisk:(紅帽)選擇一個硬盤
備用節點獲取權限的時候爲了放至其他集羣訪問資源,造成文件系統崩潰。需要讓其他集羣節點徹底kill掉。
第二部分:具體配置佈置
1、確保兩個節點主機名和解析,時間一至。時間可以使用ntp服務器來統一,請自行安裝。在兩臺node上做相同配置
[root@node1 ~]# ntpdate 172.16.0.1 17 Apr 22:12:41 ntpdate[2079]: adjust time server 172.16.0.1 offset -0.001233 sec [root@node1 ~]# cat /etc/sysconfig/network NETWORKING=yes HOSTNAME=node1.nyist.com [root@node1 ~]# uname -n node1.nyist.com [root@node1 ~]# vim /etc/hosts 172.16.20.31 node1.nyist.com node1 172.16.20.32 node2.nyist.com node2 [root@node1 ~]# ping node1 PING node1.syist.com (172.16.20.31) 56(84) bytes of data. 64 bytes from node1.syist.com (172.16.20.31): icmp_seq=1 ttl=64 time=0.052 ms 64 bytes from node1.syist.com (172.16.20.31): icmp_seq=2 ttl=64 time=0.0
2、配置兩節點實現基於ssh的密鑰認證:目的是爲了方便使用ssh命令管理對方節點。簡化操作,當然,你要是有精力一次又一次的輸入祕密驗證,那此步驟也可以省略。(兩個節點同時做一下操作)
[root@node1 ~]# ssh-keygen -t rsa -P ‘’ [root@node1 ~]# ls .ssh/ authorized_keys id_rsa id_rsa.pub [root@node1 ~]# ssh-copy-id -i .ssh/id_rsa.pub [email protected] [root@node1 ~]# ssh node2 Last login: Thu Apr 17 22:24:51 2014 from 172.16.20.55 [root@node2 ~]#
3、Ok開始安裝程序吧、(雙節的都要安裝)
[root@node1 heartbeat2]# yum install perl-TimeDate PyXML libnet net-snmp-libs
首先用yum先解決一些依賴。然後用rpm安裝,注意,安裝heartbeat的時候不要用yum,因爲用yum安裝會因爲版本問題替換掉heartbeat所依賴的一下包。 [root@node1 heartbeat2]# rpm -ivh heartbeat-2.1.4-12.el6.x86_64.rpm heartbeat-pils-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm
4、安裝好了讓我們來配置吧。安裝好heartbeat後,並沒有生成配置文件,但是heartbeat提供了配置文件樣本,所以需要我們將其copy過來。注意authkey的訪問權限需要是600,因爲它是跟安全相關的。
[root@node1 ha.d]# cp /usr/share/doc/heartbeat-2.1.4/authkeys /etc/ha.d [root@node1 ha.d]# cp /usr/share/doc/heartbeat-2.1.4/ha.cf /etc/ha.d [root@node1 ha.d]# cp /usr/share/doc/heartbeat-2.1.4/haresources /etc/ha.d
5、修改配置文件參數時其適用你的要求
[root@node1 ha.d]# vim authkeys auth 2 #1 crc 2 sha1 redhat #3 md5 Hello! [root@node1 ha.d]# chmod 600 authkeys [root@node1 ha.d]# vim ha.cf--------主配置文件 #debugfile /var/log/ha-debug------調試的時候開啓用 File to write other messages to # logfile /var/log/ha-log---------------日誌記錄位置 #logfacility local0--------------也可以local7收日誌 keepalive 1500ms-------------------心跳檢測時間間隔 # # deadtime: how long-to-declare-host-dead? deadtime 6--------------------確認死亡時間 warntime 3-------------------節點掉線警告 # serial serialportname ...---串行心跳線時候使用的 #serial /dev/ttyS0 # Linux #serial /dev/cuaa0 # FreeBSD #serial /dev/cuad0 # FreeBSD 6.x #serial /dev/cua/a # Solaris mcast eth0 225.0.100.1 694 1 0---------多播方式通知 auto_failback on-------------------是否實現自動轉回。 #stonith baytech /etc/ha.d/conf/stonith.baytech-------stonith設備,有設備了可以啓用 node node1.nyist.com-------------------標識節點個數 node node2.nyist.com #debug 1------------------------------------調試級別 #compression bz2-------------壓縮 #compression_threshold 2
分別給兩個節點提供兩個簡單的驗證頁面然後驗證httpd是否正常服務。切記不能讓服務開機自動啓動
[root@node1 ha.d]# curl node1 <h1>172.16.20.31@@@node1</h1> [root@node1 ha.d]# curl node2 curl: (7) couldn't connect to host [root@node1 ha.d]# curl node2 <h1>172.16.20.32@@@node2</h1> [root@node1 ha.d]# chkconfig httpd off [root@node1 ha.d]# ssh node2 'chkconfig httpd off'
6、定義資源管理器開始提供服務。
[root@node1 ha.d]# vim haresources node1.nyist.com 172.16.20.100/16/eth0 httpd 定義虛擬IP:172.16.20.100
(定義資源:1.ip,2.httpd服務)
Ip:會自動去/etc/ha.d/resource.d目錄下找IP定義的腳本定義IP
定義的Httpd服務也會去次目錄還有/etc/rc.d/init.d/目錄下找相應的服務
[root@node1 ha.d]# ls /etc/ha.d/resource.d/
apache IPaddr OCF
AudibleAlarm IPaddr2 portblock
db2 IPsrcaddr Raid1
Delay IPv6addr SendArp
Filesystem LinuxSCSI ServeRAID
hto-mapfuncs LVM WAS
ICP LVSSyncDaemonSwap WinPopup
ids MailTo Xinetd
Cp到節點二
[root@node1 ha.d]# scp ha.cf authkeys haresources root@node2:/etc/ha.d
7、啓動服務
[root@node1 ~]# service heartbeat start logd is already running Starting High-Availability services: 2014/04/17_23:46:35 INFO: Resource is stopped 2014/04/17_23:46:35 INFO: Resource is stopped Done. [root@node1 ~]# ssh node2 'service heartbeat start' [root@node1 ha.d]# ifconfig eth0 Link encap:Ethernet HWaddr 00:0C:29:02:06:8E inet addr:172.16.20.31 Bcast:172.16.255.255 Mask:255.255.0.0 inet6 addr: fe80::20c:29ff:fe02:68e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:33997 errors:0 dropped:0 overruns:0 frame:0 TX packets:6394 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:24361024 (23.2 MiB) TX bytes:777306 (759.0 KiB) eth0:0 Link encap:Ethernet HWaddr 00:0C:29:02:06:8E inet addr:172.16.20.100 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:78 errors:0 dropped:0 overruns:0 frame:0 TX packets:78 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9562 (9.3 KiB) TX bytes:9562 (9.3 KiB)
8、查看日誌可以看出,有一個節點嘗試連接,時間超時後自動KILL掉
[root@node1 ~]# tail /var/log/ha-log -f .heartbeat[2600]: 2014/04/18_18:41:12 info: ************************** heartbeat[2600]: 2014/04/18_18:41:12 info: Configuration validated. Starting heartbeat 2.1.4 heartbeat[2601]: 2014/04/18_18:41:12 info: heartbeat: version 2.1.4 heartbeat[2601]: 2014/04/18_18:41:12 info: Heartbeat generation: 1397817331 heartbeat[2601]: 2014/04/18_18:41:12 info: glib: UDP multicast heartbeat started for group 228.15.100.1 port 694 interface eth0 (ttl=1 loop=0) heartbeat[2601]: 2014/04/18_18:41:12 info: glib: ping heartbeat started. heartbeat[2601]: 2014/04/18_18:41:12 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2601]: 2014/04/18_18:41:12 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2601]: 2014/04/18_18:41:12 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[2601]: 2014/04/18_18:41:12 info: Local status now set to: 'up' heartbeat[2601]: 2014/04/18_18:41:13 info: Link 172.16.0.1:172.16.0.1 up. heartbeat[2601]: 2014/04/18_18:41:13 info: Status update for node 172.16.0.1: status ping heartbeat[2601]: 2014/04/18_18:41:31 WARN: node node2.nyist.com: is dead heartbeat[2601]: 2014/04/18_18:41:31 info: Comm_now_up(): updating status to active heartbeat[2601]: 2014/04/18_18:41:31 info: Local status now set to: 'active' heartbeat[2601]: 2014/04/18_18:41:31 WARN: No STONITH device configured. heartbeat[2601]: 2014/04/18_18:41:31 WARN: Shared disks are not protected. heartbeat[2601]: 2014/04/18_18:41:31 info: Resources being acquired from node2.nyist.com. harc[2613]:2014/04/18_18:41:31 info: Running /etc/ha.d/rc.d/status status mach_down[2644]:2014/04/18_18:41:31 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[2644]:2014/04/18_18:41:31 info: mach_down takeover complete for node node2.nyist.com. heartbeat[2601]: 2014/04/18_18:41:31 info: mach_down takeover complete. heartbeat[2601]: 2014/04/18_18:41:31 info: Initial resource acquisition complete (mach_down) IPaddr[2687]:2014/04/18_18:41:31 INFO: Resource is stopped heartbeat[2614]: 2014/04/18_18:41:31 info: Local Resource acquisition completed. harc[2747]:2014/04/18_18:41:31 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[2747]:2014/04/18_18:41:31 received ip-request-resp 172.16.20.100/16/eth0 OK yes ResourceManager[2766]:2014/04/18_18:41:31 info: Acquiring resource group: node1.nyist.com 172.16.20.100/16/eth0 httpd IPaddr[2792]:2014/04/18_18:41:31 INFO: Resource is stopped ResourceManager[2766]:2014/04/18_18:41:31 info: Running /etc/ha.d/resource.d/IPaddr 172.16.20.100/16/eth0 start IPaddr[2889]:2014/04/18_18:41:31 INFO: Using calculated netmask for 172.16.20.100: 255.255.0.0 IPaddr[2889]:2014/04/18_18:41:31 INFO: eval ifconfig eth0:0 172.16.20.100 netmask 255.255.0.0 broadcast 172.16.255.255 IPaddr[2860]:2014/04/18_18:41:31 INFO: Success ResourceManager[2766]:2014/04/18_18:41:31 info: Running /etc/init.d/httpd start heartbeat[2601]: 2014/04/18_18:41:41 info: Local Resource acquisition completed. (none) heartbeat[2601]: 2014/04/18_18:41:41 info: local resource transition completed. heartbeat[2601]: 2014/04/18_18:43:42 info: Link node2.nyist.com:eth0 up. heartbeat[2601]: 2014/04/18_18:43:42 info: Status update for node node2.nyist.com: status init heartbeat[2601]: 2014/04/18_18:43:42 info: Status update for node node2.nyist.com: status up harc[3034]:2014/04/18_18:43:42 info: Running /etc/ha.d/rc.d/status status harc[3050]:2014/04/18_18:43:42 info: Running /etc/ha.d/rc.d/status status heartbeat[2601]: 2014/04/18_18:43:43 info: Status update for node node2.nyist.com: status active harc[3065]:2014/04/18_18:43:43 info: Running /etc/ha.d/rc.d/status status heartbeat[2601]: 2014/04/18_18:43:44 info: remote resource transition completed. heartbeat[2601]: 2014/04/18_18:43:44 info: node1.nyist.com wants to go standby [foreign] heartbeat[2601]: 2014/04/18_18:43:44 info: standby: node2.nyist.com can take our foreign resources heartbeat[3080]: 2014/04/18_18:43:44 info: give up foreign HA resources (standby). heartbeat[3080]: 2014/04/18_18:43:45 info: foreign HA resource release completed (standby). heartbeat[2601]: 2014/04/18_18:43:45 info: Local standby process completed [foreign]. heartbeat[2601]: 2014/04/18_18:43:45 WARN: 1 lost packet(s) for [node2.nyist.com] [11:13] heartbeat[2601]: 2014/04/18_18:43:45 info: remote resource transition completed. heartbeat[2601]: 2014/04/18_18:43:45 info: No pkts missing from node2.nyist.com! heartbeat[2601]: 2014/04/18_18:43:45 info: Other node completed standby takeover of foreign resources. Ourselves [root@node2 ha.d]# tail /var/log/ha-log heartbeat[17427]: 2014/04/18_18:43:41 info: ************************** heartbeat[17427]: 2014/04/18_18:43:41 info: Configuration validated. Starting heartbeat 2.1.4 heartbeat[17428]: 2014/04/18_18:43:41 info: heartbeat: version 2.1.4 heartbeat[17428]: 2014/04/18_18:43:41 info: Heartbeat generation: 1397749464 heartbeat[17428]: 2014/04/18_18:43:41 info: glib: UDP multicast heartbeat started for group 228.15.100.1 port 694 interface eth0 (ttl=1 loop=0) heartbeat[17428]: 2014/04/18_18:43:41 info: glib: ping heartbeat started. heartbeat[17428]: 2014/04/18_18:43:41 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[17428]: 2014/04/18_18:43:41 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[17428]: 2014/04/18_18:43:41 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[17428]: 2014/04/18_18:43:42 info: Local status now set to: 'up' heartbeat[17428]: 2014/04/18_18:43:42 info: Link 172.16.0.1:172.16.0.1 up. heartbeat[17428]: 2014/04/18_18:43:42 info: Status update for node 172.16.0.1: status ping heartbeat[17428]: 2014/04/18_18:43:43 info: Link node1.nyist.com:eth0 up. heartbeat[17428]: 2014/04/18_18:43:43 info: Status update for node node1.nyist.com: status active harc[17440]:2014/04/18_18:43:43 info: Running /etc/ha.d/rc.d/status status heartbeat[17428]: 2014/04/18_18:43:43 info: Comm_now_up(): updating status to active heartbeat[17428]: 2014/04/18_18:43:43 info: Local status now set to: 'active' heartbeat[17428]: 2014/04/18_18:43:44 info: remote resource transition completed. heartbeat[17428]: 2014/04/18_18:43:44 info: remote resource transition completed. heartbeat[17428]: 2014/04/18_18:43:44 info: Local Resource acquisition completed. (none) heartbeat[17428]: 2014/04/18_18:43:44 info: node1.nyist.com wants to go standby [foreign] heartbeat[17428]: 2014/04/18_18:43:45 info: standby: acquire [foreign] resources from node1.nyist.com heartbeat[17458]: 2014/04/18_18:43:45 info: acquire local HA resources (standby). heartbeat[17458]: 2014/04/18_18:43:45 info: local HA resource acquisition completed (standby). heartbeat[17428]: 2014/04/18_18:43:45 info: Standby resource acquisition done [foreign]. heartbeat[17428]: 2014/04/18_18:43:45 info: Initial resource acquisition complete (auto_failback) heartbeat[17428]: 2014/04/18_18:43:45 info: remote resource transition completed.
9、用瀏覽器驗證故障專業效果。
關閉node1節點的時候轉移到node2上。重新打開node1後,轉移到了node1.因爲我們設置了資源優先級。
使用nfs創建共享存儲:讓web使用(假如不允許同時掛載)
創建提供NFS的服務器
[root@My2 ~]# mkdir /www/htdoc -p [root@My2 ~]# setfacl -m u:apache:rwx /www/htdoc/ [root@My2 ~]# vim /etc/exports /www/htdoc 172.16.0.0/16 (rw) [root@My2 ~]# vim /www/htdoc/index.html <h1>from NFS server</h1> [root@My2 ~]# servie nfs start
去各節點編輯資源:
[root@node1 ha.d]# vim haresources node1.nyist.com 172.16.20.100/16/eth0 Filesystem::172.16.20.32:/www/htdoc::/var/www/html::nfs httpd 定義了三個資源有先後順序之分 172.16.20.100/16/eth0 172.16.20.32:/www/htdoc httpd
重啓heartbeat服務即可見成果,測試,停掉一臺node後,不僅vip回收了,而且掛載的NFS也自動卸載掉了,放置對數據資源的佔有,
故障轉移的日誌信息
IPaddr[22187]:2014/04/18_22:10:40 INFO: Resource is stopped ResourceManager[22161]:2014/04/18_22:10:40 info: Running /etc/ha.d/resource.d/IPaddr 172.16.20.100/16/eth0 start IPaddr[22284]:2014/04/18_22:10:40 INFO: Using calculated netmask for 172.16.20.100: 255.255.0.0 IPaddr[22284]:2014/04/18_22:10:40 INFO: eval ifconfig eth0:0 172.16.20.100 netmask 255.255.0.0 broadcast 172.16.255.255 IPaddr[22255]:2014/04/18_22:10:40 INFO: Success Filesystem[22385]:2014/04/18_22:10:40 INFO: Resource is stopped ResourceManager[22161]:2014/04/18_22:10:40 info: Running /etc/ha.d/resource.d/Filesystem 172.16.20.62:/www/htdoc /var/www/html nfs start Filesystem[22463]:2014/04/18_22:10:41 INFO: Running start for 172.16.20.62:/www/htdoc on /var/www/html Filesystem[22452]:2014/04/18_22:10:41 INFO: Success ResourceManager[22161]:2014/04/18_22:10:41 info: Running /etc/init.d/httpd start mach_down[22136]:2014/04/18_22:10:41 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[22136]:2014/04/18_22:10:41 info: mach_down takeover complete for node node1.nyist.com. heartbeat[22097]: 2014/04/18_22:10:41 info: mach_down takeover complete. heartbeat[22097]: 2014/04/18_22:10:41 info: Initial resource acquisition complete (mach_down) heartbeat[22097]: 2014/04/18_22:10:51 info: Local Resource acquisition completed. (none) heartbeat[22097]: 2014/04/18_22:10:51 info: local resource transition completed.
測試頁面