On 11/28/05, David Teigland <teigland@xxxxxxxxxx> wrote: > On Mon, Nov 28, 2005 at 02:07:52PM -0700, busy admin wrote: > > I'm doing some testing with manual fencing and here's what I've found: > > > > Using the RHEL4 branch of code and running on a RHEL4 U1 system manual > > fencing doesn't seem to work. If I have a simple two node cluster and > > force a reboot of the primary node (node1 - running the service) the > > service fails over to the secondary node (node2) and starts running > > without me having to execute 'fence_ack_manual -n node1'. In fact if I > > look in look at the /tmp filesystem I don't see the fifo file ever > > being created. So in fact if I try to execute 'fence_ack_manual' it > > complains about the fifo file not existing. So it's as if fenced > > calling fence_manual isn't creating the fifo file to begin with. > > > > Using the STABLE branch and building against and 2.6.12 kernel manual > > fencing works as expected. When I force a reboot of the system running > > the service, the service doesn't fail over until I manually execute > > 'fence_ack_manual', then the service starts sucessfully on the > > remaining node. > > > > Any comments? Anyone else observe this same behavior? Is this just > > broken in the RHEL4 branch? > > I can't recall or see any changes since RHEL4U1 that would explain this. > Could you run fenced -D and send the output? Here's a quick summary of what I've done and the results... to simplify the config I've just been running ccsd and cman via init scripts during boot and then manual executing 'fenced' or 'fence_tool' or the fenced init script. The results I see are random success's and failures! Initial test - reboot both systems and then, on both, executed 'fenced -D' both systems joined the cluster and it was quorate. Rebooted one node and to my surprise manual fencing worked, meaning /tmp/fence_manual.fifo was created and I had to run 'fence_ack_manual' on the other node. Tried again when the first node came back up and again everything worked as expected. Additional testing - reboot both system and then, on both, executed 'fence_tool join -w', both systems joined the cluster and it was quorate. Rebooted one node and no fencing was done (nothing logged in /var/log/messages). rebooted both systems again and this time executed 'fenced -D' on both nodes... rebooted a node and fencing worked, was logged in /var/log/messages and I had to manual run 'fence_ack_manual -n x64-5'. when that node came back up again I again manually executed 'fenced -D' on it and the cluster was quorate. I then rebooted the other node and again fencing worked! so again I rebooted both nodes and executed 'fence_tool join -w' on each... I again rebooted a node and fencing worked this time. fenced msgs were logged to /var/log/messages, /tmp/fence_manual.fifo was created and I had to execute 'fence_ack_manual -n x64-4' to recover. ... more testing w/mixed results ... modified fenced init script to execute 'fenced -D &' instead of 'fence_tool join -w' and used chkconfig to turn it on on both systems and rebooted them. both system restarted and joined the cluster. once again I rebooted one node (x64-4) and fencing didn't work... nothing was logged in /var/log/messages from fenced. see corresponding /var/log/messages, fenced -D output and cluster.conf below. ---8<--- /var/log/messages Nov 29 16:45:56 x64-5 kernel: CMAN: removing node x64-4 from the cluster : Shutdown Nov 29 16:46:34 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Nov 29 16:49:22 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Nov 29 16:49:24 x64-5 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Nov 29 16:50:08 x64-5 kernel: CMAN: node x64-4 rejoining --->8--- ---8<--- fenced -D output fenced: 1133307956 start: fenced: 1133307956 event_id = 2 fenced: 1133307956 last_stop = 1 fenced: 1133307956 last_start = 2 fenced: 1133307956 last_finish = 1 fenced: 1133307956 node_count = 1 fenced: 1133307956 start_type = leave fenced: 1133307956 members: fenced: 1133307956 nodeid = 2 "x64-5" fenced: 1133307956 do_recovery stop 1 start 2 finish 1 fenced: 1133307956 add node 1 to list 2 fenced: 1133307956 finish: fenced: 1133307956 event_id = 2 fenced: 1133307956 last_stop = 1 fenced: 1133307956 last_start = 2 fenced: 1133307956 last_finish = 2 fenced: 1133307956 node_count = 0 fenced: 1133308214 stop: fenced: 1133308214 event_id = 0 fenced: 1133308214 last_stop = 2 fenced: 1133308214 last_start = 2 fenced: 1133308214 last_finish = 2 fenced: 1133308214 node_count = 0 fenced: 1133308214 start: fenced: 1133308214 event_id = 3 fenced: 1133308214 last_stop = 2 fenced: 1133308214 last_start = 3 fenced: 1133308214 last_finish = 2 fenced: 1133308214 node_count = 2 fenced: 1133308214 start_type = join fenced: 1133308214 members: fenced: 1133308214 nodeid = 2 "x64-5" fenced: 1133308214 nodeid = 1 "x64-4" fenced: 1133308214 do_recovery stop 2 start 3 finish 2 fenced: 1133308214 finish: fenced: 1133308214 event_id = 3 fenced: 1133308214 last_stop = 2 fenced: 1133308214 last_start = 3 fenced: 1133308214 last_finish = 3 fenced: 1133308214 node_count = 0 --->8--- ---8<--- cluster.conf <?xml version="1.0"?> <cluster config_version="22" name="testcluster"> <cman expected_votes="1" two_node="1"/> <clusternodes> <clusternode name="x64-5"> <fence> <method name="single"> <device name="manual" ipaddr="x64-5"/> </method> </fence> </clusternode> <clusternode name="x64-4"> <fence> <method name="single"> <device name="manual" ipaddr="x64-4"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_manual" name="manual"/> </fencedevices> <rm> <failoverdomains/> <resources> <ip address="10.0.0.120" monitor_link="1"/> </resources> <service name="ipservice"> <ip ref="10.0.0.120"/> </service> </rm> </cluster> --->8--- -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster