Hi all,
I'm having some major stability problems with my three-node CS/GFS cluster.
Every two or three days, one of the nodes fences another, and I have to
hard-reboot the entire cluster to recover. I have had this happen twice
today. I don't know what's triggering the fencing, since all the nodes
appear to me to be up and running when it happens. In fact, I was logged
on to node3 just now, running 'top', when node2 fenced it.
When they come up, they don't automatically mount their GFS filesystems,
even with "_netdev" specified as a mount option; however, the node which
comes up first mounts them all as part of bringing all the services up.
I did notice a couple of disconcerting things earlier today. First, I was
running "watch clustat". (I prefer to see the time updating, where I
can't with "clustat -i") At one point, "clustat" crashed as follows:
Jan 2 15:19:54 node2 kernel: clustat[17720]: segfault at 0000000000000024
rip 0000003629e75bc0 rsp 00007fff18827178 error 4
Fairly shortly thereafter, clustat reported that node3 as "Online,
Estranged, rgmanager". Can anyone shed light on what that means?
Google's not telling me much.
At the moment, all three nodes are running CentOS 5.1, with kernel
2.6.18-53.1.4.el5. Can anyone point me in the right direction to resolve
these problems? I wasn't having trouble like this when I was running a
CentOS 4 CS/GFS cluster. Is it possible to downgrade, likely via a full
rebuild of all the nodes, from CentOS 5 CS/GFS to 4? Should I instead
consider setting up a single node to mount the GFS filesystems and serve
them out, to get around these fencing issues?
Thanks,
James
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster