On 11/09/13 7:31 PM, Digimer wrote: > That log message does show the node joining. Can you reliably > reproduce this? If so, can you please 'tail -f -n 0 /var/log/messages' > on both nodes, break the cluster and wait for the node to restart, > 'tail' the rebooted node's /var/log/messages, wait the six minutes and > then, after the second fence occurs, post both node's logs? > I was indeed able to reliably reproduce this and that's where my confusion came from. I don't understand why the node seems to be joining (and leaving immediately afterwards as per the log), all within the 360secs post join fence delay and still gets fenced. As this is a semi-production system (we had to move quickly), I went with a qdisk based approach now, using a small iscsi disk from a remote site. This works very well and reliable as far as I can tell from the testing that I have done so far. I would still be interested to hear why the initial approach failed. How would have manually starting the cluster services a difference anyway? Does that mean that one should join the cluster and fence domain first to ensure a stateless join and only then start rgmanager? Isn't that something that could be achieved with some delays in the startup scripts as well? Either way, thank you all for helping out this quick! -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster