I think I understand how it works. It's good to know that the loser of the first race doesn't immediately try fence device 2. If it's really a race then the delay in node 2's retry attempt is necessary for it to be killed before it retries. The ssh handshaking when logging into the APC does take a few seconds. If I set the delay specifically for the purpose of spanning the necessary logins then that should take care of it.
If the logging into all fence devices before any are turned off can't easily be done, then the other approach to make it safe would be to delay all the log offs until the end of the process.
Thanks for you help, I need to make sure the boss is getting her money's worth from this effort.
scottb -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster