David, I have tried the same init scripts with both ipmi and drac fencing, no problems. When I try manual fencing (it seems) that fence_manual introduces some strangeness such that I run into my problem. What is the problem: When running manual fencing and doing failover testing, my secondary node takes over the service without waiting for a fence_ack_manual. This all works perfectly with automatic fencing (ipmi, drac). I have the same problem (most of the time) when I run this whole thing by hand: 1. nodeA: ccsd 2. nodeB: ccsd 3. nodeA: cman_tool join -w 4. nodeB: cman_tool join -w 5. nodeA: fence_tool join -w 6. nodeB: fence_tool join -w When I start to see the problem, on the next reboot of both the systems I can replace steps 5 & 6 with 'fenced -D'. Now if I try to failover a machine then manual fencing works perfectly (meaning forces me to do a fence_ack_manual before a service fails over). Next, I can go in and change 'fenced -D' back to 'fence_tool join -w' and things still work (forces me to run fence_ack_manual). Next, if I replace the manual steps above with the init scripts then manual fencing breaks all over again until I repeat the above steps. Sounds like a timing issue around fence_manual? Let me know if you want me to try anything different. Thanks for all your help. On 11/30/05, David Teigland <teigland@xxxxxxxxxx> wrote: > On Tue, Nov 29, 2005 at 07:53:09PM -0700, busy admin wrote: > > Here's a quick summary of what I've done and the results... to > > simplify the config I've just been running ccsd and cman via init > > scripts during boot and then manual executing 'fenced' or 'fence_tool' > > or the fenced init script. The results I see are random success's and > > failures! > > > > Initial test - reboot both systems and then, on both, executed 'fenced > > -D' both systems joined the cluster and it was quorate. Rebooted one > > node and to my surprise manual fencing worked, meaning > > /tmp/fence_manual.fifo was created and I had to run 'fence_ack_manual' > > on the other node. Tried again when the first node came back up and > > again everything worked as expected. > > > > Additional testing - reboot both system and then, on both, executed > > 'fence_tool join -w', both systems joined the cluster and it was > > quorate. Rebooted one node and no fencing was done (nothing logged in > > /var/log/messages). > > > > rebooted both systems again and this time executed 'fenced -D' on both > > nodes... rebooted a node and fencing worked, was logged in > > /var/log/messages and I had to manual run 'fence_ack_manual -n x64-5'. > > when that node came back up again I again manually executed 'fenced > > -D' on it and the cluster was quorate. I then rebooted the other node > > and again fencing worked! > > > > so again I rebooted both nodes and executed 'fence_tool join -w' on > > each... I again rebooted a node and fencing worked this time. fenced > > msgs were logged to /var/log/messages, /tmp/fence_manual.fifo was > > created and I had to execute 'fence_ack_manual -n x64-4' to recover. > > > > ... more testing w/mixed results ... > > > > modified fenced init script to execute 'fenced -D &' instead of > > 'fence_tool join -w' and used chkconfig to turn it on on both systems > > and rebooted them. both system restarted and joined the cluster. once > > again I rebooted one node (x64-4) and fencing didn't work... nothing > > was logged in /var/log/messages from fenced. see corresponding > > /var/log/messages, fenced -D output and cluster.conf below. > > It's not clear what you're trying to test or what you expect to happen. > Here's the optimal way to start up a cluster from a newly rebooted state: > > 1. nodeA: ccsd > 2. nodeB: ccsd > 3. nodeA: cman_tool join -w > 4. nodeB: cman_tool join -w > 5. nodeA: fence_tool join > 6. nodeB: fence_tool join > > It's best if steps 5 & 6 only happen after both nodes are members of > the cluster (see 'cman_tool nodes'). If this is the case, then no > nodes should be fenced when starting up. > > If you use the init scripts you may loose a little control and certainty > about what happens when, so I'd suggest using the commands directly until > you know that things are running correctly, then try the init scripts. > > If, from the state above, nodeB fails, then nodeA should always fence > nodeB. With manual fencing, this means that a message should appear in > nodeA's /var/log/messages telling you to reboot nodeB and run > fence_ack_manual. If, by chance, nodeB reboots and rejoins the cluster > before you get to running fence_ack_manual, the fencing system on nodeA > will just complete the fencing operation itself and you don't need to run > fence_ack_manual (and if you try, the fence_ack_manual command will report > an error.) > > Dave > > -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster