Hi, I found a work around for this by removing cman, clvmd and gfs from the startup directories and rebooting all the nodes then starting cman simultaneously on all nodes by hand, to avoid fencing. I then started clvmd the same way. I think this problem may have been caused by our Cisco 2960G gigabit switches which might be blocking multicasts. How can I check that multicasts from clvmd are being received by all nodes? We have two Cisco 2960G switches which we bond eth0 and eth1 to so that we have a redundant data path for all the CMAN traffic. The nodes are connected to a SAN via fiber HBAs. Are these switches suitable for this application? TIA for any help... And many thanks for the help already given.... Shaun -----Original Message----- From: Shaun Mccullagh Sent: Wednesday, December 03, 2008 2:57 PM To: linux clustering Subject: RE: Unexpected problems with clvmd Hi Chrissie Fence status is ''none'' on all nodes Shaun for i in 10.0.154.10 pan5.tmf pan6.tmf pan4.tmf ; do ssh $i /sbin/group_tool | grep fence; done fence 0 default 00000000 none fence 0 default 00000000 none fence 0 default 00000000 none fence 0 default 00000000 none -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Christine Caulfield Sent: woensdag 3 december 2008 14:53 To: linux clustering Subject: Re: Unexpected problems with clvmd Shaun Mccullagh wrote: > Hi, > > I tried to add another node to our 3 node cluster this morning. > > Initially things went well; but I wanted to check the new node booted > correctly. > > After the second reboot clvmd failed to start up on the new node > (called > pan4): > > [root@pan4 ~]# clvmd -d1 -T20 > CLVMD[8e1e8300]: Dec 3 14:24:09 CLVMD started > CLVMD[8e1e8300]: Dec 3 14:24:09 Connected to CMAN > CLVMD[8e1e8300]: Dec 3 14:24:12 CMAN initialisation complete > > Group_tool reports this output for clvmd on all four nodes in the > cluster > > dlm 1 clvmd 00010005 FAIL_START_WAIT > dlm 1 clvmd 00010005 FAIL_ALL_STOPPED > dlm 1 clvmd 00010005 FAIL_ALL_STOPPED > dlm 1 clvmd 00000000 JOIN_STOP_WAIT > > Otherwise the cluster is OK: > > [root@brik3 ~]# clustat > Cluster Status for mtv_gfs @ Wed Dec 3 14:38:26 2008 Member Status: > Quorate > > Member Name ID Status > ------ ---- ---- ------ > pan4 4 Online > pan5 5 Online > nfs-pan 6 Online > brik3-gfs 7 Online, > Local > > [root@brik3 ~]# cman_tool status > Version: 6.1.0 > Config Version: 4 > Cluster Name: mtv_gfs > Cluster Id: 14067 > Cluster Member: Yes > Cluster Generation: 172 > Membership state: Cluster-Member > Nodes: 4 > Expected votes: 4 > Total votes: 4 > Quorum: 3 > Active subsystems: 8 > Flags: Dirty > Ports Bound: 0 11 > Node name: brik3-gfs > Node ID: 7 > Multicast addresses: 239.192.54.42 > Node addresses: 172.16.1.60 > > It seems I have created a deadlock, what is the best way to fix this? > > TIA > > The first thing is to check the fencing status, via group_tool and syslog. If fencing hasn't completed then the DLM can't recover. -- Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster Op dit e-mailbericht is een disclaimer van toepassing, welke te vinden is op http://www.espritxb.nl/disclaimer -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster