Doug Tucker wrote:
Thanks to you and Carlos. I understand a bit better now what you are
referring to, however, I don't believe that is the issue. The reason we
went to the crossover cable was to avoid this issue, as we had a switch
die once, and both then thought they were master and tried to fence the
other. In my situation, there is no reason for the missed heartbeat
that I can find. The interfaces have not gone down. We ran a test
where I started a ping between the 2 that wrote out to a file until a
"heartbeat" missed and a reboot occurred. There was not a single missed
ping between the 2 nodes prior to the event. Also in a split brain,
both machines should recognize the other one "gone" and try to become
master. In this case, only 1 of the nodes at a time is seeing a "missed
heartbeat" and then attempting to fence the other. We have replaced all
hardware to include cables even to ensure it wasn't that. This appears
to be some software bug of sorts. Again, we have another 2 node cluster
that this doesn't occur on, but, they are running a different kernel and
gfs module.
ping is udp. is the heartbeat udp or tcp?
perhaps you could ensure both servers have their clocks sync'ed and then
run wireshark on each server capturing the crossover cable ethernet port
and see which one is failing to signal the other...
hth
yvette hirth
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster