Hi all, I corrected my cluster (done a new one with luci) Here is my new cluster.conf (and my questions after ... :)) <?xml version="1.0"?> <cluster alias="TEST" config_version="85" name="TEST"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="1"> <device name="rsa_node1"/> </method> </fence> </clusternode> <clusternode name="node2" nodeid="2" votes="1"> <fence> <method name="1"> <device name="rsa_node2"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="5"/> <fencedevices> <fencedevice agent="fence_rsa" ipaddr="rsa_node1" login="ADMIN" name="rsa_node1" passwd="password"/> <fencedevice agent="fence_rsa" ipaddr="rsa_node2" login="ADMIN" name="rsa_node2" passwd="password"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="test" nofailback="0" ordered="1" restricted="1"> <failoverdomainnode name="node1" priority="1"/> <failoverdomainnode name="node2" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="192.168.10.20" monitor_link="1"/> <clusterfs device="/dev/vg1data/lv1data" force_unmount="1" fsid="47478" fstype="gfs2" mountpoint="/data" name="lv1data" self_fence="0"/> <clusterfs device="/dev/vg1app/lv1app" force_unmount="1" fsid="11699" fstype="gfs2" mountpoint="/app" name="lv1app" self_fence="0"/> </resources> <service autostart="1" domain="TEST" exclusive="1" name="TEST" recovery="disable"> <ip ref="172.28.104.80"> <clusterfs fstype="gfs" ref="lv1data"/> <clusterfs fstype="gfs" ref="lv1app"/> </ip> </service> </rm> <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/> <quorumd device="/dev/vg1quorum/lv1quorum" interval="1" min_score="1" tko="10" votes="3"> <heuristic interval="10" program="/usr/sbin/qdiskd" score="1"/> </quorumd> </cluster> Clustat gives : Cluster Status for TEST@ Tue Mar 9 18:12:25 2010 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1 1 Online, Local, rgmanager node2 2 Online, rgmanager /dev/vg1quorum/lv1quorum 0 Online, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- service:TEST node1 started Now questions (sorry I read several website in english, but mine is a bit poor ...): - <cman expected_votes="5"/> => what is this number 5 ? where does it comme from ? - I don't understand why now quorum is visible. I did the same thing that before (mkqdisk etc etc ..) - <quorumd device="/dev/vg1quorum/lv1quorum" interval="1" min_score="1" tko="10" votes="3"> => Why 3 ? It is for node1, node2 and quorum ? - Luci purpose to automatically start cman and rgmanager, is it a good idea ? Qdiskd is started in runlevel 2345 by system, same question, is it a good thing ? - I still have a network with IP's node1 in 10.0.0.10, and IP's node2 in 10.0.0.20, could I insert in cluster.conf as a network heartbeat ? - Last one for fencing (:) ). Node1 will use rsa_node2 to kill node2, and node2 will use rsa_node1 to kill node1 ? => <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="1"> <device name="rsa_node1"/> </method> </fence> </clusternode> Is it right ? With this configuration, I shutdown ( with vilence) my first node, and my service came on the second one, so it seems working fine but I don't understand really why ... Thanks for your help mog Le lundi 08 mars 2010 à 13:37 +0200, שלום קלמר a écrit : > Hi. > > you got some errors on your cluster.conf file. > > 1. you must check that fence_rsa works berfore starting the cluster. > 2. if you are using quorumd, change cman to : <cman > expected_votes="3" two_node="0"/> > 3. put quorumd votes=1 , min_score=1 > 4. change your heuristic program to somthing like ping to your > router.( its better to add more heuristics ) > 5. install most updated rpms of cman openais & rgmanager. > 6. clustat should show qdisk is online. cman should start qdiskd . > > for more information you can read : > > http://sources.redhat.com/cluster/wiki/FAQ/CMAN ( it helps me . ) > > > Regards > > Shalom. > > On Mon, Mar 8, 2010 at 12:44 PM, <mogruith@xxxxxxx> wrote: > > Hi all > > Here is my cluster.conf: > > > <?xml version="1.0"?> > <cluster config_version="6" name="TEST"> > <quorumd device="/dev/vg1quorom/lv1quorom" interval="1" > label="quorum" > min_score="3" tko="10" votes="3"> > <heuristic interval="2" > program="/usr/sbin/qdiskd" score="1"/> > </quorumd> > <fence_daemon post_fail_delay="0" post_join_delay="3"/> > <cman expected_votes="1" two_node="1"/> > <clusternodes> > <clusternode name="node1" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device > name="RSA_node1"/> > </method> > </fence> > </clusternode> > <clusternode name="node2" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device > name="RSA_node2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman/> > <fencedevices> > <fencedevice agent="fence_rsa" > ipaddr="RSA_node1" login="USER" > name="RSA_node1" passwd="PASSWORD"/> > <fencedevice agent="fence_rsa" > ipaddr="RSA_node2" login="USER" > name="RSA_node2" passwd="PASSWORD"/> > </fencedevices> > <rm> > <failoverdomains> > <failoverdomain name="TEST" ordered="1" > restricted="1"> > <failoverdomainnode > name="node1" priority="1"/> > <failoverdomainnode > name="node2" priority="2"/> > </failoverdomain> > </failoverdomains> > <resources> > <ip address="172.28.104.80" > monitor_link="1"/> > <clusterfs > device="/dev/vg1data/lv1data" > force_unmount="0" fsid="30516" fstype="gfs2" > mountpoint="/data" name="DATA" > options=""/> > </resources> > <service autostart="1" domain="TEST" > exclusive="1" name="TEST"> > <ip ref="172.28.104.80"> > <clusterfs ref="DATA"/> > </ip> > </service> > </rm> > </cluster> > > > N.B > node1, node2 , RSA_node1 and RSA_node2 are set in /etc/hosts > > When I move service from node1 to node2 (by a force reboot on > node1), it fails > (because of probably a network problem) but is there a > timeout ? If node2 can't > connect to rsa node1, why it doesnt consider that node1is > "dead" and why service > doesn't go on node2 ? > > Here is the clustat > > [root@node2 ~]# clustat > Cluster Status for TEST @ Mon Mar 8 11:33:32 2010 > Member Status: Quorate > > Member Name > ID Status > ------ ---- > ---- ------ > node1 > 1 Offline > node2 > 2 Online, Local, rgmanager > > Service Name > Owner (Last) > State > ------- ---- > ----- ------ > ----- > service:TEST > node1 > stopping > > It's stopping like that since 30min ! > > Here is the log: > > Mar 8 11:35:45 node2 fenced[7038]: agent "fence_rsa" reports: > Unable to > connect/login to fencing device > Mar 8 11:35:45 node2 fenced[7038]: fence "node1" failed > Mar 8 11:35:50 node2 fenced[7038]: fencing node "node1" > Mar 8 11:35:56 node2 fenced[7038]: agent "fence_rsa" reports: > Unable to > connect/login to fencing device > Mar 8 11:35:56 node2 fenced[7038]: fence "node1" failed > > Why node2 is still trying to fence node1 ? > > Here is something else : > > [root@node2 ~]# cman_tool services > type level name id state > fence 0 default 00010001 FAIL_START_WAIT > [2] > dlm 1 rgmanager 00020001 FAIL_ALL_STOPPED > [1 2] > > How to verify quorum is used ? > > Last question : I have 3 networks (6 nic, 3 bonding), one is > dedicated for > heartbeat. where I have to set it in cluster.conf ? I would > like node1 and node2 > communicate by their own bond3 . > > Thanks for your help. > > mog > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster