Re: two node cluster with IP tiebreaker failed.

Mockey Chen <mockey.chen@xxxxxxx> · Thu, 26 Feb 2009 09:28:13 +0800

ext Brett Cave wrote:
> On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey.chen@xxxxxxx> wrote:
>   
>> ext Kein He wrote:
>>     
>>> I think there is a problem, from "cman_tool status" shows:
>>>
>>> Nodes: 2
>>> Expected votes: 3
>>> Total votes: 2
>>>
>>>
>>> according to your cluster.conf , if all nodes and qdisk are online,
>>> the "Total votes" must be "3".  Probably "qdiskd" is not running, you
>>> can use " cman_tool nodes" to check if qdisk is working.
>>>
>>>       
>> Yes, here is "cman_tool nodes" output:
>> Node  Sts   Inc   Joined               Name
>>   1   M    112   2009-02-25 03:05:19  as-1.localdomain
>>   2   M    104   2009-02-25 03:05:19  as-2.localdomain
>>
>> A question is how to check whether qdisk is running ? and how to run it ?
>>     
>
> [root@blade3 ~]# service qdiskd status
> qdiskd (pid 2832) is running...
> [root@blade3 ~]# pgrep qdisk -l
> 2832 qdiskd
> [root@blade3 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    0   M      0   2009-02-19 16:11:55  /dev/sda5     ## This is qdisk.
>    1   M   1524   2009-02-20 22:27:32  blade1
>    2   M   1552   2009-02-24 04:39:24  blade2
>    3   M   1500   2009-02-19 16:11:03  blade3
>    4   M   1516   2009-02-19 16:11:22  blade4
>
> You can use "service qdisk start" to start it, or run it with
> /usr/sbin/qdisk -Q if you dont have the init script. If you installed
> from rpm on a rh type distro, then the script should be there.
>
> REgards,
> brett
>   
I try to use "service qdiskd start", but it failed:
[root@as-2 ~]# service qdiskd start
Starting the Quorum Disk Daemon:                           [FAILED]
[root@as-2 ~]# tail /var/log/messages
...
Feb 26 09:19:40 as-2 qdiskd[14707]: <crit> Unable to match label
'testing' to any device
Feb 26 09:19:46 as-2 clurgmgrd[4032]: <notice> Reconfiguring

Here is my qdisk configuration, I copy it from "man qdisk":
        <quorumd interval="1" tko="10" votes="1" label="testing">
                <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
interval="2" tko="3"/>
        </quorumd>

How to map label to device. Note: I did not have any shared storage.

>> Thanks.
>>     
>>>
>>> Mockey Chen wrote:
>>>       
>>>> ext Mockey Chen wrote:
>>>>
>>>>         
>>>>> ext Kein He wrote:
>>>>>
>>>>>           
>>>>>> Hi Mockey,
>>>>>>
>>>>>> Could you please attach the output from " cman_tool status " and "
>>>>>> cman_tool nodes -f" ?
>>>>>>
>>>>>>
>>>>>>             
>>>>> Thanks your response.
>>>>>
>>>>> I try to run cman_tool status on as-2, but it hang, without output, and
>>>>> even Ctrl+C also no effect.
>>>>>
>>>>>           
>>>> I manually reboot as-1, and the problem solved.
>>>>
>>>> There is the output of cman_tool
>>>>
>>>> [root@as-1 ~]# cman_tool status
>>>> Version: 6.1.0
>>>> Config Version: 19
>>>> Cluster Name: azerothcluster
>>>> Cluster Id: 20148
>>>> Cluster Member: Yes
>>>> Cluster Generation: 76
>>>> Membership state: Cluster-Member
>>>> Nodes: 2
>>>> Expected votes: 3
>>>> Total votes: 2
>>>> Quorum: 2 Active subsystems: 8
>>>> Flags: Dirty
>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>> Node ID: 1
>>>> Multicast addresses: 239.192.78.3
>>>> Node addresses: 10.56.150.3
>>>> [root@as-1 ~]# cman_tool status -f
>>>> Version: 6.1.0
>>>> Config Version: 19
>>>> Cluster Name: azerothcluster
>>>> Cluster Id: 20148
>>>> Cluster Member: Yes
>>>> Cluster Generation: 76
>>>> Membership state: Cluster-Member
>>>> Nodes: 2
>>>> Expected votes: 3
>>>> Total votes: 2
>>>> Quorum: 2 Active subsystems: 8
>>>> Flags: Dirty
>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>> Node ID: 1
>>>> Multicast addresses: 239.192.78.3
>>>> Node addresses: 10.56.150.3
>>>>
>>>>
>>>> It seems cluster can not fence one of the node. How to solve it ?
>>>>
>>>>
>>>>         
>>>>> I open a new window and can using ssh to as-2, but  after login,  I can
>>>>> not do anything, even a
>>>>> simple 'ls' command is hung.
>>>>>
>>>>> It seem the system keep alive but do not provide any service. Really
>>>>> bad.
>>>>>
>>>>> Any way to debug this issue ?
>>>>>
>>>>>           
>>>>>> Mockey Chen wrote:
>>>>>>
>>>>>>             
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>>>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>>>>> <?xml version="1.0"?>
>>>>>>> <cluster alias="azerothcluster" config_version="19"
>>>>>>> name="azerothcluster">
>>>>>>>     <cman expected_votes="3" two_node="0"/>
>>>>>>>     <clusternodes>
>>>>>>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>>>>>>             <fence>
>>>>>>>                 <method name="1">
>>>>>>>                     <device name="ilo1"/>
>>>>>>>                 </method>
>>>>>>>             </fence>
>>>>>>>         </clusternode>
>>>>>>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>>>>>>             <fence>
>>>>>>>                 <method name="1">
>>>>>>>                     <device name="ilo2"/>
>>>>>>>                 </method>
>>>>>>>             </fence>
>>>>>>>         </clusternode>
>>>>>>>     </clusternodes>
>>>>>>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>>>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1"
>>>>>>> score="1"
>>>>>>> interval="2" tko="3"/>
>>>>>>>         </quorumd>
>>>>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>>>>     <fencedevices>
>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>>>>> login="power" name="ilo1" passwd="pass"/>
>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>>>>> login="power" name="ilo2" passwd="pass"/>
>>>>>>>     </fencedevices>
>>>>>>> ...
>>>>>>> ...
>>>>>>>
>>>>>>> To test one node lost heartbeat case, I disable ethereal card
>>>>>>> (eth0) on
>>>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>>>>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and
>>>>>>> try to
>>>>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>>>>
>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>>>>>>> OPERATIONAL state.
>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>>>>>>> recv buffer size (288000 bytes).
>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>>>>>>> send buffer size (262142 bytes).
>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>> from 2.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>> from 0.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>>>>> because I am the rep.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>>>>>>> seq received 1f4
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
>>>>>>> id for
>>>>>>> ring 2c
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>>>>> 10.56.150.4:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>>>>>>> 10.56.150.3
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered
>>>>>>> 1f4
>>>>>>> received flag 1
>>>>>>>
>>>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40
>>>>>>> as-2
>>>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>>>>> recovery.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>>>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>>>>> activity
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>>>>>>> primary component and will provide service.
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>>>>>>> connection.
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
>>>>>>> state.
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>>>>> Connection refused
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>>>>> 10.56.150.4
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>>>>>>> node 2
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>> evil.
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>>> request descriptor
>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>> evil.
>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>>> request descriptor
>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>> evil.
>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>>>>>>> Invalid request descriptor
>>>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
>>>>>>> record for
>>>>>>> 10.56.150.144 on eth0.
>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>>>>>>> Address already in use
>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I also found there are some errors in as-1's syslog
>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>>>>>>> status
>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>>>>>>> detected
>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>>>>>>> ...
>>>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>> infrastructure after 30 seconds.
>>>>>>> ...
>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>> infrastructure after 60 seconds.
>>>>>>> ...
>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>> infrastructure after 90 seconds.
>>>>>>>
>>>>>>>
>>>>>>> any comment is appreciated!
>>>>>>>
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster@xxxxxxxxxx
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>>               
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster@xxxxxxxxxx
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>>
>>>>>>             
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster@xxxxxxxxxx
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>>           
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster@xxxxxxxxxx
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>         
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>     
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster