[已解决]cman启动提示Waiting for quorum... Timed-out waiting for cluster

centos6.5+rgmanager+cman+gfs2,想实现FC SAN存储共享访问。
目前是两台服务器做集群,启动过程如下:
[root@VM6~]# service cman start
Starting cluster:
   Checking if cluster has been disabled at boot... [确定]
   Checking Network Manager... [确定]
   Global setup... [确定]
   Loading kernel modules... [确定]
   Mounting configfs... [确定]
   Starting cman... [确定]
   Starting qdiskd... [确定]
   Waiting for quorum... Timed-out waiting for cluster
[失败]
Stopping cluster:
   Leaving fence domain... [确定]
   Stopping gfs_controld... [确定]
   Stopping dlm_controld... [确定]
   Stopping fenced... [确定]
   Stopping qdiskd... [确定]
   Stopping cman... [确定]
   Waiting for corosync to shutdown:[确定]
   Unloading kernel modules... [确定]
   Unmounting configfs... [确定]
 
messages日志见附件。cluster.conf配置如下:
<?xml version="1.0"?>
<cluster config_version="6" name="clvmcluster">
        <clusternodes>
                <clusternode name="VM5.test.cn" nodeid="1"/>
                <clusternode name="VM6.test.cn" nodeid="2"/>
        </clusternodes>
        <cman expected_votes="3" quorum_dev_poll="45000"/>
        <totem token="135000" token_retransmits_before_loss_const="10"/>
        <fencedevices><fencedevice name="myfence" agent="fence_manual"/></fencedevices>
        <quorumd label="myqdisk" min_score="1">
                <heuristic interval="3" program="ping -c1 -t1 192.168.0.254" tko="10"/>
        </quorumd>
</cluster>
 
各位帮忙看看是什么问题,还有哪里可以看到有价值的日志,谢谢。
还有,刚刚通过luci创建集群的时候是正常的,用了两三个小时后,node2就断开了,然后就怎么都启动不了,报上面的错误。
已邀请:

Firxiao

赞同来自: a_lzf

配置debug日志  cluster.conf 中加
<logging to_syslog="yes" to_logfile="yes" syslog_facility="local4"
syslog_priority="info" logfile_priority="debug">
<logging_daemon name="qdiskd"
logfile="/var/log/cluster/qdiskd.log"/>
<logging_daemon name="fenced"
logfile="/var/log/cluster/fenced.log"/>
<logging_daemon name="dlm_controld"
logfile="/var/log/cluster/dlm_controld.log"/>
<logging_daemon name="gfs_controld"
logfile="/var/log/cluster/gfs_controld.log"/>
<logging_daemon name="rgmanager"
logfile="/var/log/cluster/rgmanager.log"/>
<logging_daemon name="corosync"
logfile="/var/log/cluster/corosync.log"/>
</logging>

a_lzf - 一句话介绍

赞同来自:

两台服务器防火墙已关闭

Firxiao

赞同来自:

试下更改cluster.conf 中
cman expected_votes="1"
重启node1 node2 中cman服务
 
 
参考https://access.redhat.com/docu ... .html

a_lzf - 一句话介绍

赞同来自:

改了没用,我配置了qdisk,所以expected_votes="3" ,然后two_node=1会没有掉。
刚刚又试了下,发现两台电脑都关机,然后启动后,在luci里面将两个节点join cluster,同时启动就又可以。
但是这不是想要的结果,这样坏一台集群就挂了。

Firxiao

赞同来自:

是因为qdisk没有参加投票导致集群挂起
 <quorumd label="myqdisk" min_score="1"> 增加 votes="1"
如果你用luci配置的话
打开 Enable "expert" mode
即可配置qdisk的votes
 
配置完成后查看你qdisk的状态
 
#clustat
/dev/block/8:17                             0 Online, Quorum Disk
是否为online
 
#cman_tool status|grep "Quorum device votes"
Quorum device votes: 1
检查qdisk是否参与投票
 
 

a_lzf - 一句话介绍

赞同来自:

不好意思,这两天一直在测试fence。qdisk应该是正常的,虽然没设置,但是默认好像就是1
[root@VM5 cluster]# clustat
Cluster Status for clvmcluster @ Tue Jun 30 11:48:25 2015
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 VM5.test.cn                                                     1 Online, Local
 VM6.test.cn                                                     2 Online
 /dev/block/253:4                                                0 Online, Quorum Disk

[root@VM5 cluster]# cman_tool status|grep "Quorum device votes"
Quorum device votes: 1
 
我现在就是两台都重启后,通过luci同时启用两台,就是正常的,能支持一两个小时,然后一台挂掉后,就再也用不了了,现在我加入了fence设备,再观察观察效果。
 
不过有一台服务器的fence不知道是否能正常使用,因为下面的原因:
ipmitool -I open -H 192.168.0.6 -U root -P 123456 power status
我有一台服务器ipmi版本比较低,要加-I open参数才能获取状态。
做fence的时候,要用fence_ipmilan测试,但是这个命令怎么体现-I open参数呢。


用fence_ipmilan命令的时候,就获取不到状态,因为没有open参数

Spawning: '/usr/bin/ipmitool -I lan -H '192.168.0.6' -U 'root' -P '[set]' -v chassis power status'
看这提示,好像转换后的命令是-I lan

 
 
 

Firxiao

赞同来自:

请提供下当一节点宕机后 正常节点的cman_tool status 及 宕机节点加入cluster时 正常节点的日志
 
fence测试 请使用fence_node node_name 测试下
 

a_lzf - 一句话介绍

赞同来自:

# cman_tool status
Version: 6.2.0
Config Version: 13
Cluster Name: clvmcluster
Cluster Id: 15636
Cluster Member: Yes
Cluster Generation: 6396
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Quorum device votes: 1
Total votes: 2
Node votes: 1
Quorum: 2 
Active subsystems: 11
Flags:
Ports Bound: 0 11 177 178 
Node name: VM5.test.cn
Node ID: 1
Multicast addresses: 239.192.61.81
Node addresses: 192.168.0.135





# fence_node -S VM5.test.cn
status VM5.test.cn dev 0.0 agent fence_ipmilan result: status error
status VM5.test.cn failed -1








正常节点日志:


Jun 30 15:21:25 VM5 qdiskd[15440]: Node 2 shutdown
Jun 30 15:21:32 VM5 ricci[32797]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:21:32 VM5 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: Permission denied
Jun 30 15:21:33 VM5 ricci[32851]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1389444572'
Jun 30 15:21:33 VM5 ricci[32855]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/960304385'
Jun 30 15:21:36 VM5 ricci[32884]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:21:36 VM5 ricci[32886]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1313678498'
Jun 30 15:21:36 VM5 ricci[32889]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/331549025'
Jun 30 15:21:39 VM5 ricci[32895]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:21:39 VM5 ricci[32897]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1706900687'
Jun 30 15:21:39 VM5 ricci[32902]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1422487015'
Jun 30 15:22:10 VM5 ricci[32969]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:10 VM5 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: Permission denied
Jun 30 15:22:11 VM5 ricci[33021]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/2037869509'
Jun 30 15:22:11 VM5 ricci[33024]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/2098344773'
Jun 30 15:22:14 VM5 ricci[33030]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:14 VM5 ricci[33033]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1882238927'
Jun 30 15:22:14 VM5 ricci[33036]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/340213007'
Jun 30 15:22:27 VM5 qdiskd[15440]: Node 2 shutdown
Jun 30 15:22:31 VM5 ricci[33082]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:31 VM5 ricci[33084]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/2094857633'
Jun 30 15:22:32 VM5 ricci[33087]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1686102141'
Jun 30 15:22:41 VM5 ricci[33109]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:41 VM5 ricci[33111]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/248063161'
Jun 30 15:22:41 VM5 ricci[33114]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/914261937'
Jun 30 15:22:44 VM5 ricci[33121]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:44 VM5 ricci[33123]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/655660045'
Jun 30 15:22:44 VM5 ricci[33129]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/109282885'
Jun 30 15:22:48 VM5 ricci[33158]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:48 VM5 ricci[33160]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/592593966'
Jun 30 15:22:49 VM5 ricci[33165]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:22:49 VM5 ricci[33167]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1041757480'
Jun 30 15:22:49 VM5 ricci[33170]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1316377082'
Jun 30 15:22:58 VM5 qdiskd[15440]: Node 2 shutdown
Jun 30 15:23:17 VM5 ricci[33227]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:23:17 VM5 ricci[33229]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1425210145'
Jun 30 15:23:17 VM5 ricci[33234]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/57680535'
Jun 30 15:23:30 VM5 ricci[33253]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:23:30 VM5 ricci[33255]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/965078932'
Jun 30 15:23:30 VM5 ricci[33258]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1625011633'
Jun 30 15:23:40 VM5 ricci[33301]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:23:40 VM5 ricci[33303]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1633631745'
Jun 30 15:23:41 VM5 ricci[33306]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/871886494'
Jun 30 15:24:00 VM5 qdiskd[15440]: Node 2 shutdown
Jun 30 15:24:05 VM5 ricci[33361]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:24:05 VM5 ricci[33363]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1800602587'
Jun 30 15:24:05 VM5 ricci[33366]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1660821938'
Jun 30 15:24:38 VM5 ricci[33428]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:24:38 VM5 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: Permission denied
Jun 30 15:24:38 VM5 ricci[33480]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1579850728'
Jun 30 15:24:38 VM5 ricci[33484]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1349498553'
Jun 30 15:24:43 VM5 ricci[33520]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:24:43 VM5 ricci[33522]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/124296493'
Jun 30 15:24:43 VM5 ricci[33525]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1923425233'
Jun 30 15:25:14 VM5 qdiskd[15440]: Node 2 shutdown
Jun 30 15:25:30 VM5 ricci[33626]: Executing '/usr/bin/virsh nodeinfo'
Jun 30 15:25:30 VM5 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: Permission denied
Jun 30 15:25:30 VM5 ricci[33680]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/733442949'
Jun 30 15:25:30 VM5 ricci[33683]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1013733291'
Jun 30 15:30:53 VM5 qdiskd[15440]: Node 2 shutdown

a_lzf - 一句话介绍

赞同来自:

fence_node测试不成功。但是ipmitool可以成功获取远程服务器的信息,fence_ipmilan也可以成功。是哪里配置不对吗。
[root@VM6]# fence_node -S VM5.test.cn
status VM5.test.cn dev 0.0 agent fence_ipmilan result: status error
status VM5.test.cn failed -1
[root@VM6]# ipmitool -H 192.168.0.5 -U root -P 123456 power status
Chassis Power is on
[root@VM6]# fence_ipmilan -v -a 192.168.0.5 -l root -p 123456 -o status
Getting status of IPMI:192.168.0.5...Spawning: '/usr/bin/ipmitool -I lan -H '192.168.0.5' -U 'root' -P '[set]' -v chassis power status'...
Chassis power = On
Done
 
[root@VM5]# ipmitool -I open -H 192.168.0.6 -U root -P 123456 power status
Chassis Power is on
 
# cat cluster.conf
<?xml version="1.0"?>
<cluster config_version="14" name="testcluster">
        <clusternodes>
                <clusternode name="VM5.test.cn" nodeid="1">
                        <fence>
                                <method name="VM5">
                                        <device name="VM5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="VM6.test.cn" nodeid="2">
                        <fence>
                                <method name="VM6">
                                        <device name="VM6"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <quorumd label="myqdisk" min_score="1">
                <heuristic interval="3" program="ping -c1 -t1 172.22.22.254" tko="10"/>
        </quorumd>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="password" ipaddr="192.168.0.5" login="root" name="VM5" passwd="123456"/>
                <fencedevice agent="fence_ipmilan" auth="password" ipaddr="192.168.0.6" login="root" name="VM6" passwd="123456"/>
        </fencedevices>
</cluster>

Firxiao

赞同来自:

从你发的cman_tool status 来看 当一个节点down掉 集群没有挂起 
从日志中看不出有用的信息 请尝试在down掉节点启动的同时 观察正常节点日志 及 当集群正常时 突然关掉其中一台服务后 正常节点的日志
另外检查下selinux 不排除selinux造成的影响
fence的话 你可以联系下设备供应商 或者 网上查下 怎么配置你设备型号的fence 
 

a_lzf - 一句话介绍

赞同来自:

selinux是disabled的。fence配置已经没有问题了。配置了qdisk,所以挂一个节点,集群没挂。
 
但是问题就是我怎么都无法将断开的结点再加回集群,只能把所有集群的服务器重启后,通过luci统一join cluster,同时我测试了一下,也无法象网上说的那样可以随时起停cman、rgmanager、clvmd等服务,不论按什么顺序。
 
前两天luci高级配置里面勾了几个选项,现在gfs2共享存储也没问题了,就差这个加回集群的问题。
 
日志翻来翻去也没看到什么特别的异常,都贴到网上搜了,各种方法都试过。总觉得就差那么一点点,但就不知道问题出在哪里,快要崩溃了。

Firxiao

赞同来自:

[b]离开集群[/b]
[root@ ~]# /etc/init.d/clvmd stop
Deactivating clustered VG(s): 0 logical volume(s) in volume group "gfs2" now active
[确定]
Signaling clvmd to exit [确定]
clvmd terminated [确定]
[root@ ~]# /etc/init.d/rgmanager stop
Stopping Cluster Service Manager: [确定]
[root@ ~]# /etc/init.d/cman stop
Stopping cluster:
Leaving fence domain... [确定]
Stopping gfs_controld... [确定]
Stopping dlm_controld... [确定]
Stopping fenced... [确定]
Stopping qdiskd... [确定]
Stopping cman... [确定]
Waiting for corosync to shutdown: [确定]
Unloading kernel modules... [确定]
Unmounting configfs... [确定]


[b]加入集群[/b]
[root@ ~]# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [确定]
Checking Network Manager... [确定]
Global setup... [确定]
Loading kernel modules... [确定]
Mounting configfs... [确定]
Starting cman... [确定]
Starting qdiskd... [确定]
Waiting for quorum... [确定]
Starting fenced... [确定]
Starting dlm_controld... [确定]
Tuning DLM kernel config... [确定]
Starting gfs_controld... [确定]
Unfencing self... [确定]
Joining fence domain... [确定]

[root@ ~]# /etc/init.d/clvmd start
Starting clvmd:
Activating VG(s): 1 logical volume(s) in volume group "gfs2" now active
2 logical volume(s) in volume group "VolGroup" now active
[确定]

 
 

a_lzf - 一句话介绍

赞同来自:

万分感谢Firxiao的热心帮助,问题已经解决了,原因可能是我的交换机不支持UDP Multicast,我把Network Transport Type改成UDP Broadcast后,所有功能都正常了。
 

arthurpk1209

赞同来自:

a_lzf 大大:

可否提供您的配置文件給小弟我呢? fence_device 您後來是怎麼解決的呀? 小弟我用的是idrac 7 , 試了很多次都不行, 感謝!

要回复问题请先登录注册