ext Brett Cave wrote: > On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey.chen@xxxxxxx> wrote: > >> ext Kein He wrote: >> >>> I think there is a problem, from "cman_tool status" shows: >>> >>> Nodes: 2 >>> Expected votes: 3 >>> Total votes: 2 >>> >>> >>> according to your cluster.conf , if all nodes and qdisk are online, >>> the "Total votes" must be "3". Probably "qdiskd" is not running, you >>> can use " cman_tool nodes" to check if qdisk is working. >>> >>> >> Yes, here is "cman_tool nodes" output: >> Node Sts Inc Joined Name >> 1 M 112 2009-02-25 03:05:19 as-1.localdomain >> 2 M 104 2009-02-25 03:05:19 as-2.localdomain >> >> A question is how to check whether qdisk is running ? and how to run it ? >> > > [root@blade3 ~]# service qdiskd status > qdiskd (pid 2832) is running... > [root@blade3 ~]# pgrep qdisk -l > 2832 qdiskd > [root@blade3 ~]# cman_tool nodes > Node Sts Inc Joined Name > 0 M 0 2009-02-19 16:11:55 /dev/sda5 ## This is qdisk. > 1 M 1524 2009-02-20 22:27:32 blade1 > 2 M 1552 2009-02-24 04:39:24 blade2 > 3 M 1500 2009-02-19 16:11:03 blade3 > 4 M 1516 2009-02-19 16:11:22 blade4 > > You can use "service qdisk start" to start it, or run it with > /usr/sbin/qdisk -Q if you dont have the init script. If you installed > from rpm on a rh type distro, then the script should be there. > > REgards, > brett > I try to use "service qdiskd start", but it failed: [root@as-2 ~]# service qdiskd start Starting the Quorum Disk Daemon: [FAILED] [root@as-2 ~]# tail /var/log/messages ... Feb 26 09:19:40 as-2 qdiskd[14707]: <crit> Unable to match label 'testing' to any device Feb 26 09:19:46 as-2 clurgmgrd[4032]: <notice> Reconfiguring Here is my qdisk configuration, I copy it from "man qdisk": <quorumd interval="1" tko="10" votes="1" label="testing"> <heuristic program="ping 10.56.150.1 -c1 -t1" score="1" interval="2" tko="3"/> </quorumd> How to map label to device. Note: I did not have any shared storage. >> Thanks. >> >>> >>> Mockey Chen wrote: >>> >>>> ext Mockey Chen wrote: >>>> >>>> >>>>> ext Kein He wrote: >>>>> >>>>> >>>>>> Hi Mockey, >>>>>> >>>>>> Could you please attach the output from " cman_tool status " and " >>>>>> cman_tool nodes -f" ? >>>>>> >>>>>> >>>>>> >>>>> Thanks your response. >>>>> >>>>> I try to run cman_tool status on as-2, but it hang, without output, and >>>>> even Ctrl+C also no effect. >>>>> >>>>> >>>> I manually reboot as-1, and the problem solved. >>>> >>>> There is the output of cman_tool >>>> >>>> [root@as-1 ~]# cman_tool status >>>> Version: 6.1.0 >>>> Config Version: 19 >>>> Cluster Name: azerothcluster >>>> Cluster Id: 20148 >>>> Cluster Member: Yes >>>> Cluster Generation: 76 >>>> Membership state: Cluster-Member >>>> Nodes: 2 >>>> Expected votes: 3 >>>> Total votes: 2 >>>> Quorum: 2 Active subsystems: 8 >>>> Flags: Dirty >>>> Ports Bound: 0 177 Node name: as-1.localdomain >>>> Node ID: 1 >>>> Multicast addresses: 239.192.78.3 >>>> Node addresses: 10.56.150.3 >>>> [root@as-1 ~]# cman_tool status -f >>>> Version: 6.1.0 >>>> Config Version: 19 >>>> Cluster Name: azerothcluster >>>> Cluster Id: 20148 >>>> Cluster Member: Yes >>>> Cluster Generation: 76 >>>> Membership state: Cluster-Member >>>> Nodes: 2 >>>> Expected votes: 3 >>>> Total votes: 2 >>>> Quorum: 2 Active subsystems: 8 >>>> Flags: Dirty >>>> Ports Bound: 0 177 Node name: as-1.localdomain >>>> Node ID: 1 >>>> Multicast addresses: 239.192.78.3 >>>> Node addresses: 10.56.150.3 >>>> >>>> >>>> It seems cluster can not fence one of the node. How to solve it ? >>>> >>>> >>>> >>>>> I open a new window and can using ssh to as-2, but after login, I can >>>>> not do anything, even a >>>>> simple 'ls' command is hung. >>>>> >>>>> It seem the system keep alive but do not provide any service. Really >>>>> bad. >>>>> >>>>> Any way to debug this issue ? >>>>> >>>>> >>>>>> Mockey Chen wrote: >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence >>>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf >>>>>>> <?xml version="1.0"?> >>>>>>> <cluster alias="azerothcluster" config_version="19" >>>>>>> name="azerothcluster"> >>>>>>> <cman expected_votes="3" two_node="0"/> >>>>>>> <clusternodes> >>>>>>> <clusternode name="as-1.localdomain" nodeid="1" votes="1"> >>>>>>> <fence> >>>>>>> <method name="1"> >>>>>>> <device name="ilo1"/> >>>>>>> </method> >>>>>>> </fence> >>>>>>> </clusternode> >>>>>>> <clusternode name="as-2.localdomain" nodeid="2" votes="1"> >>>>>>> <fence> >>>>>>> <method name="1"> >>>>>>> <device name="ilo2"/> >>>>>>> </method> >>>>>>> </fence> >>>>>>> </clusternode> >>>>>>> </clusternodes> >>>>>>> <quorumd interval="1" tko="10" votes="1" label="pingtest"> >>>>>>> <heuristic program="ping 10.56.150.1 -c1 -t1" >>>>>>> score="1" >>>>>>> interval="2" tko="3"/> >>>>>>> </quorumd> >>>>>>> <fence_daemon post_fail_delay="0" post_join_delay="3"/> >>>>>>> <fencedevices> >>>>>>> <fencedevice agent="fence_ilo" hostname="10.56.154.18" >>>>>>> login="power" name="ilo1" passwd="pass"/> >>>>>>> <fencedevice agent="fence_ilo" hostname="10.56.154.19" >>>>>>> login="power" name="ilo2" passwd="pass"/> >>>>>>> </fencedevices> >>>>>>> ... >>>>>>> ... >>>>>>> >>>>>>> To test one node lost heartbeat case, I disable ethereal card >>>>>>> (eth0) on >>>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot. >>>>>>> The actual is as-1 lost connection to as-2. as-2 detected it and >>>>>>> try to >>>>>>> re-construct cluster, but failed, here is the syslog form as-2 >>>>>>> >>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the >>>>>>> OPERATIONAL state. >>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket >>>>>>> recv buffer size (288000 bytes). >>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket >>>>>>> send buffer size (262142 bytes). >>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state >>>>>>> from 2. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state >>>>>>> from 0. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token >>>>>>> because I am the rep. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high >>>>>>> seq received 1f4 >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence >>>>>>> id for >>>>>>> ring 2c >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member >>>>>>> 10.56.150.4: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep >>>>>>> 10.56.150.3 >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered >>>>>>> 1f4 >>>>>>> received flag 1 >>>>>>> >>>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ... >>>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40 >>>>>>> as-2 >>>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in >>>>>>> recovery. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration: >>>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved >>>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1 >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4) >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.3) >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking >>>>>>> activity >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] CLM CONFIGURATION CHANGE >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] New Configuration: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] r(0) ip(10.56.150.4) >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Left: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] Members Joined: >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the >>>>>>> primary component and will provide service. >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate. Refusing >>>>>>> connection. >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL >>>>>>> state. >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect: >>>>>>> Connection refused >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM ] got nodejoin message >>>>>>> 10.56.150.4 >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111). >>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG ] got joinlist message from >>>>>>> node 2 >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something >>>>>>> evil. >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid >>>>>>> request descriptor >>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111). >>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something >>>>>>> evil. >>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid >>>>>>> request descriptor >>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21). >>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something >>>>>>> evil. >>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect: >>>>>>> Invalid request descriptor >>>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address >>>>>>> record for >>>>>>> 10.56.150.144 on eth0. >>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP): >>>>>>> Address already in use >>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I also found there are some errors in as-1's syslog >>>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG >>>>>>> status >>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not >>>>>>> detected >>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0... >>>>>>> ... >>>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster >>>>>>> infrastructure after 30 seconds. >>>>>>> ... >>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster >>>>>>> infrastructure after 60 seconds. >>>>>>> ... >>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster >>>>>>> infrastructure after 90 seconds. >>>>>>> >>>>>>> >>>>>>> any comment is appreciated! >>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster@xxxxxxxxxx >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster@xxxxxxxxxx >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster@xxxxxxxxxx >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster