ext Mockey Chen wrote: > ext Kein He wrote: > >> Hi Mockey, >> >> Could you please attach the output from " cman_tool status " and " >> cman_tool nodes -f" ? >> >> > Thanks your response. > > I try to run cman_tool status on as-2, but it hang, without output, and > even Ctrl+C also no effect. > I manually reboot as-1, and the problem solved. There is the output of cman_tool [root@as-1 ~]# cman_tool status Version: 6.1.0 Config Version: 19 Cluster Name: azerothcluster Cluster Id: 20148 Cluster Member: Yes Cluster Generation: 76 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 177 Node name: as-1.localdomain Node ID: 1 Multicast addresses: 239.192.78.3 Node addresses: 10.56.150.3 [root@as-1 ~]# cman_tool status -f Version: 6.1.0 Config Version: 19 Cluster Name: azerothcluster Cluster Id: 20148 Cluster Member: Yes Cluster Generation: 76 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 177 Node name: as-1.localdomain Node ID: 1 Multicast addresses: 239.192.78.3 Node addresses: 10.56.150.3 It seems cluster can not fence one of the node. How to solve it ? > I open a new window and can using ssh to as-2, but after login, I can > not do anything, even a > simple 'ls' command is hung. > > It seem the system keep alive but do not provide any service. Really bad. > > Any way to debug this issue ? > >> Mockey Chen wrote: >> >>> Hi, >>> >>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence >>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf >>> <?xml version="1.0"?> >>> <cluster alias="azerothcluster" config_version="19" >>> name="azerothcluster"> >>> <cman expected_votes="3" two_node="0"/> >>> <clusternodes> >>> <clusternode name="as-1.localdomain" nodeid="1" votes="1"> >>> <fence> >>> <method name="1"> >>> <device name="ilo1"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="as-2.localdomain" nodeid="2" votes="1"> >>> <fence> >>> <method name="1"> >>> <device name="ilo2"/> >>> </method> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <quorumd interval="1" tko="10" votes="1" label="pingtest"> >>> <heuristic program="ping 10.56.150.1 -c1 -t1" score="1" >>> interval="2" tko="3"/> >>> </quorumd> >>> <fence_daemon post_fail_delay="0" post_join_delay="3"/> >>> <fencedevices> >>> <fencedevice agent="fence_ilo" hostname="10.56.154.18" >>> login="power" name="ilo1" passwd="pass"/> >>> <fencedevice agent="fence_ilo" hostname="10.56.154.19" >>> login="power" name="ilo2" passwd="pass"/> >>> </fencedevices> >>> ... >>> ... >>> >>> To test one node lost heartbeat case, I disable ethereal card (eth0) on >>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot. >>> The actual is as-1 lost connection to as-2. as-2 detected it and try to >>> re-construct cluster, but failed, here is the syslog form as-2 >>> >>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the >>> OPERATIONAL state. >>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket >>> recv buffer size (288000 bytes). >>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket >>> send buffer size (262142 bytes). >>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state >>> from 2. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state >>> from 0. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token >>> because I am the rep. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high >>> seq received 1f4 >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence id for >>> ring 2c >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member >>> 10.56.150.4: >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep >>> 10.56.150.3 >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered 1f4 >>> received flag 1 >>> >>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ... >>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40 as-2 >>> openais[4139]: [TOTEM] Did not need to originate any messages in >>> recovery. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration: >>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved >>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1 >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4) >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left: >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.3) >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined: >>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking >>> activity >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration: >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4) >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left: >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined: >>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the >>> primary component and will provide service. >>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate. Refusing >>> connection. >>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL state. >>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect: >>> Connection refused >>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] got nodejoin message >>> 10.56.150.4 >>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111). >>> Feb 24 21:25:40 as-2 openais[4139]: [CPG ] got joinlist message from >>> node 2 >>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something >>> evil. >>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid >>> request descriptor >>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111). >>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something >>> evil. >>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid >>> request descriptor >>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21). >>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something >>> evil. >>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect: >>> Invalid request descriptor >>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address record for >>> 10.56.150.144 on eth0. >>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP): >>> Address already in use >>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse >>> >>> >>> >>> >>> I also found there are some errors in as-1's syslog >>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG >>> status >>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not >>> detected >>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0... >>> ... >>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster >>> infrastructure after 30 seconds. >>> ... >>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster >>> infrastructure after 60 seconds. >>> ... >>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster >>> infrastructure after 90 seconds. >>> >>> >>> any comment is appreciated! >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster