Ben J wrote: > Hi Christine, > > Thanks for the reply. > > I've been able to today replicate the cluster failing again by rebooting > one of the standby nodes. I captured tcpdump data from 2 of the active > nodes (store01 and store02) and from the 2 standby nodes (ha01 and > ha02). Ha01 is the node that we rebooted, so it will only show cluster > communication that occurred up until it rebooted. See attached zip file. > > Note, I've sent this off-list as I didn't want to send this to the list > for obvious reasons. :) > > Let me know if you need any further information. I've had the cluster > running with debug level 7 logging, so I've got that information as well > if you'd like me to shoot through that as well. > Thanks for the tcpdumps, they were very helpful in eliminating several possible causes I had considered. Unfortunately I still don't quite know what IS happening! It seems that when one node leaves the cluster the others go into transition MASTER state (because they all saw the node go down at the same time) and they never resolve this state. What normally happens is that one node will nominate itself master and take over the transition. But it seems like this is not happening for some reason. I did manage to reproduce it (or something very similar) on a three node cluster yesterday, unfortunately I didn't have debugging enabled in the modules so it didn't tell me much more (though it did tell me a little more). I have restarted some tests and I hope they will yield some results soon (ish). -- Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster