1.配置 RHCS 集羣的前提:
時間同步
名稱解析,這裏使用修改/etc/hosts 文件
配置好 yum 源,CentOS 6 的默認的就行
關閉防火牆(或者開放集羣所需通信端口),和selinux,
關閉 NetworkManager 服務
2. RHCS 所需要的主要軟件包爲 cman 和 rgmanager
cman: 是集羣基礎信息層,在 CentOS 6中依賴 corosync
rgmanager: 是集羣資源管理器, 類似於pacemaker 的功能
luci: 提供了管理 rhcs 集羣的 web 界面, luci 管理集羣主要是通過跟 ricci 通信來完成的。
ricci: 安裝在集羣的節點的接收來自 luci 管理請求的代理。
luci 跟 ricci 的關係就好像 ambari-server 跟 ambari-agent 一樣。
3.環境說明:
luci : 192.168.6.31 cent1.test.com ricci: 192.168.6.32 cent2.test.com ricci: 192.168.6.33 cent3.test.com ricci: 192.168.6.34 cent4.test.com
我這裏已經配好了主機名了,但是其他的如時間同步,配置/etc/hosts/ 等都沒執行,爲了方便,所以寫了個 playbook 來進行初始化一下
--- - hosts:hdpservers remote_user: root vars: tasks: - name: add synctime cron cron: name='sync time' minute='*/5'job='/usr/sbin/ntpdate 192.168.6.31' - name: shutdown iptables service: name=`item`.`name`state=`item`.`state` enabled=`item`.`enabled` with_items: - { name: iptables, state: stopped,enabled: no} - { name: NetworkManager, state: stopped,enabled: no} tags: stop service - name: copy selinux conf file copy: src=`item`.`src` dest=`item`.`dest`owner=`item`.`owner` group=`item`.`group` mode=`item`.`mode` with_items: - { src: '/etc/selinux/config', dest:/etc/selinux/config, owner: root, group: root, mode: '0644'} - { src: '/etc/hosts', dest: /etc/hosts,owner: root, group: root, mode: '0644'} - name: cmd off selinux shell: setenforce 0
執行這個 playbook,進行初始化
[root@cent1 yaml]#ansible-playbook base.yml
4.在 cent1 上安裝 luci, luci 是一個 python 程序,依賴很多python包
[root@cent1 ~]#yum install luci
啓動 luci
[root@cent3 ~]#/etc/init.d/luci start Adding followingauto-detected host IDs (IP addresses/domain names), corresponding to `cent3'address, to the configuration of self-managed certificate`/var/lib/luci/etc/cacert.config' (you can change them by editing`/var/lib/luci/etc/cacert.config', removing the generated certificate`/var/lib/luci/certs/host.pem' and restarting luci): (none suitable found, you can still doit manually as mentioned above) Generating a 2048bit RSA private key writing newprivate key to '/var/lib/luci/certs/host.pem' 正在啓動saslauthd: [確定] Start luci... [確定] Point your webbrowser to https://cent1.hfln.com:8084 (or equivalent) to access luci
現在可以在前臺登錄luci 了,看清是 https 哦
賬號密碼就是這臺主機的賬號和密碼
登錄成功啦,現在來配置 rhcs 的集羣,這個只是用來管理集羣的,真正的集羣還沒開始裝呢。
5.在 cnet2, cent3, cent4 中安裝 ricci, ricci 也依賴很多軟件,這裏使用 ansible 直接在三個節點上裝, 當然我已經配好了 cent1 到 其他節點的免密鑰登錄了
[root@cent1 ~]#ansible rhcs -m yum -a "name=ricci"
裝好ricci 之後還要在 node 節點上給 ricci 用戶設置密碼,ricci用戶就是運行 ricci進程的用戶,這個密碼一會要用,這裏就簡單粗暴了,這個密碼還可以用 ccs命令來進行設置
[root@cent1 ~]#ansible rhcs -m shell -a "echo '123456' | passwd --stdin ricci"
啓動 ricci
[root@cent1 ~]#ansible rhcs -m service -a "name=ricci state=started enabled=yes" [root@cent2 ~]# ss-tunlp |grep ricci tcp LISTEN 0 5 :::11111 :::* users:(("ricci",3237,3))
ricci 監聽在 11111 端口,像這種操作當然也是可以寫到 playbook 當中的
6. 現在可以在web 界面上配置集羣了,比如創建/添加/刪除一個集羣,管理node, resource, fence device, servicegroups, Failover Domains 等等集羣的全生命週期都可以在這裏完成。
這裏演示一個關於 web服務的高可用服務
Manage Clusters--> Create 是創建一個集羣
這個界面還算簡單吧;
Create Cluster 之後,那麼就開始嘗試安裝集羣軟件了.
在任意一個node上可以看到 ricci 的工作進程:
[root@cent2 ~]# psaux |grep ricci ricci 3453 0.1 0.4 213664 4400 ? S<s 17:18 0:00 ricci -u ricci ricci 3489 0.0 0.1 54912 1908 ? S<s 17:22 0:00 /usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1500004777 root 3490 0.2 0.5 48552 5136 ? S 17:22 0:00 ricci-modrpm root 3567 0.0 0.0 103252 880 pts/0 S+ 17:24 0:00 grep ricci
/var/lib/ricci/queue/目錄下存放的是 luci 發給 ricci 的任務文件,是 XML 格式的
[root@cent2 ~]#file /var/lib/ricci/queue/1500004777 /var/lib/ricci/queue/1500004777:XML document text
7. 安裝成功了
可以點任何一個node 進去看看
如果這底下的服務沒啓動的話,可以嘗試手動起一下,一般來說是OK的。
8.添加資源
這裏沒有 fence 設備,不關注這個,添加兩個公共資源,並添加一個服務,然後來啓動服務
Resources -->Add : 添加一個資源
添加一個虛擬IP,這裏的 mask 要寫成上面這樣,不能寫成 255.255.255.0 這種,否則會導致無法添加IP
rgmanager Startingstopped service service:web1 rgmanager start onip "192.168.6.100/255.255.255.0" returned 1 (generic error) rgmanager #68:Failed to start service:web1; return value: 1
再添加一個script資源
9.添加 Service
這裏的資源是共公的,假如這個集羣內有多個服務,那麼都可以使用這些資源,也可以在
Service Groups 裏添加一個私有的資源。
現在添加一個Service:
Service Groups--> Add : 添加一個 Service,
Add Resource 將剛纔建立的兩個資源添加進來;
現在在集羣的節點上用命令查看一下,集羣內的任何節點都可以
[root@cent3 ~]#clustat Cluster Status forha1 @ Sun Jan 8 17:47:40 2017 Member Status:Quorate Member Name ID Status ------ ---- ---- ------ cent2.test.com 1 Online, rgmanager cent3.test.com 2 Online, Local, rgmanager cent4.test.com 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:web1 cent2.test.com started
在 cent2 上 ip 和httpd 服務都已經起來了
[root@cent2 ~]# ipa 1: lo:<LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen1000 link/ether 00:0c:29:91:b3:11 brdff:ff:ff:ff:ff:ff inet 192.168.6.32/24 brd 192.168.6.255scope global eth0 inet 192.168.6.100/24 scope global secondaryeth0 inet6 fe80::20c:29ff:fe91:b311/64 scopelink valid_lft forever preferred_lft forever [root@cent2 ~]#netstat -tunlp |grep 80 tcp 0 0 :::80 :::* LISTEN 34901/httpd
10.測試故障轉移:
關於 rhcs 中 service 的健康狀態檢測, 可以通過 /var/log/cluster/rgmanager.log 日誌來查看
Jan 08 18:56:59rgmanager [ip] Checking 192.168.6.100/24, Level 10 Jan 08 18:56:59rgmanager [ip] 192.168.6.100/24 present on eth0 Jan 08 18:56:59rgmanager [ip] Link for eth0: Detected Jan 08 18:56:59rgmanager [ip] Link detected on eth0 Jan 08 18:56:59rgmanager [ip] Local ping to 192.168.6.100 succeeded
這裏可以看到他會嘗試查看和 ping 192.168.6.100 ,這是針對 IP 資源的檢測方式
Jan 08 18:55:49rgmanager [script] Executing /etc/rc.d/init.d/httpd status
上面是 script 資源的檢測方式則是僅僅去用腳本來執行 status 參數。
在我嘗試將/etc/init.d/httpd/ stop 後,日誌出現瞭如下:
Jan 08 18:56:59rgmanager [script] Executing /etc/rc.d/init.d/httpd status Jan 08 18:56:59rgmanager [script] script:http1: status of /etc/rc.d/init.d/httpd failed(returned 3) # 這裏發現檢測失敗了 Jan 08 18:56:59rgmanager status on script "http1" returned 1 (generic error) Jan 08 18:56:59rgmanager Stopping service service:web1 Jan 08 18:56:59rgmanager [script] Executing /etc/rc.d/init.d/httpd stop Jan 08 18:56:59rgmanager [ip] Removing IPv4 address 192.168.6.100/24 from eth0 # 以上幾步在這個節點停止了 web1 服務 Jan 08 18:57:09rgmanager Service service:web1 is recovering Jan 08 18:57:14rgmanager Service service:web1 is now running on member 2 # 將web1 服務在 member 2 上恢復了,member 2 也就是 cent3.test.com
查看轉移後的集羣狀態:
[root@cent3 ~]# clustat Cluster Status for ha1 @ Sun Jan 8 20:25:26 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ cent2.test.com 1 Online, rgmanager cent3.test.com 2 Online, Local, rgmanager cent4.test.com 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:web1 cent3.test.com started
如果這種 script 的資源不符合你的需求,那麼可以嘗試 apache 資源。即使你認爲這種 script 的資源檢查方式過於簡單,也可以在腳本里添加功能來達到你的目的。
11.嘗試關閉節點,查看 Service 轉移情況:
在關掉 cent3 之後,service 轉移到了 cent4上
[root@cent2 ~]#clustat Cluster Status forha1 @ Sun Jan 8 20:35:42 2017 Member Status:Quorate Member Name ID Status ------ ---- ---- ------ cent2.test.com 1 Online, Local, rgmanager cent3.test.com 2 Offline cent4.test.com 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:web1 cent4.test.com started
接着關掉了 cent4,Service 又轉移到了 cent2
[root@cent2 ~]#clustat Cluster Status forha1 @ Sun Jan 8 20:36:27 2017 Member Status:Quorate Member Name ID Status ------ ---- ---- ------ cent2.test.com 1 Online, Local, rgmanager cent3.test.com 2 Offline cent4.test.com 3 Online Service Name Owner (Last) State ------- ---- ----- ------ ----- service:web1 cent2.test.com started
這裏的 cent4.test.com 仍然顯示 Online 是因爲正在關機當中,尚未真正關閉。
過了幾秒,彈出了以下提示信息:
[root@cent2 ~]# Message fromsyslogd@cent2 at Jan 8 20:36:42 ... rgmanager[5685]: #1: Quorum Dissolved
日誌裏顯示:
Jan 08 20:35:01rgmanager Member 2 shutting down Jan 08 20:36:18rgmanager Member 3 shutting down Jan 08 20:36:18rgmanager Starting stopped service service:web1 Jan 08 20:36:18rgmanager [ip] Link for eth0: Detected Jan 08 20:36:19rgmanager [ip] Adding IPv4 address 192.168.6.100/24 to eth0 Jan 08 20:36:19rgmanager [ip] Pinging addr 192.168.6.100 from dev eth0 Jan 08 20:36:21rgmanager [ip] Sending gratuitous ARP: 192.168.6.100 00:0c:29:91:b3:11 brdff:ff:ff:ff:ff:ff Jan 08 20:36:22rgmanager [script] Executing /etc/rc.d/init.d/httpd start Jan 08 20:36:22rgmanager Service service:web1 started Jan 08 20:36:42rgmanager #1: Quorum Dissolved Message fromsyslogd@cent2 at Jan 8 20:36:42 ... rgmanager[5685]: #1: Quorum Dissolved Jan 08 20:36:42rgmanager [script] Executing /etc/rc.d/init.d/httpd stop Jan 08 20:36:42rgmanager [ip] Removing IPv4 address 192.168.6.100/24 from eth0
服務停止了,這是因爲 法定票數不足的原因
[root@cent2 ~]#clustat Service statesunavailable: Operation requires quorum Cluster Status forha1 @ Sun Jan 8 20:37:00 2017 Member Status:Inquorate Member Name ID Status ------ ---- ---- ------ cent2.test.com 1 Online, Local cent3.test.com 2 Offline cent4.test.com 3 Offline