On Sat, 9 Dec 2006, Robert Peterson wrote:
Barry Brimer wrote:
Everything was working fine for several months on this cluster. The
cluster software is the latest provided by Red Hat for RHEL4. Latest
kernel. I am using fence_ilo, and the working node fenced the problem
node.
Same versions - RHEL 4 Red Hat Latest:
I've since discovered that another GFS cluster (non-production) had a
similar issue, and a reboot on both nodes solved this problem. With the
original (production) cluster, I am trying to figure out how to get the
problem node back into the cluster without having to unmount the GFS volume
from the remaining working node.
Thank you so much for your input, it is greatly appreciated.
If you have any more suggestions, particularly on how to get my problem
node back into the cluster without unmounting the GFS volume from the
working node, please let me know.
Thanks,
Barry
Hi Barry,
Hm. Can it be that your other nodes are still running the old cman in
memory?
This might happen if you update the kernel code and cman code with up2date
to the latest, but haven't rebooted or loaded the new cluster kernel modules
yet on the
remaining nodes. That would also explain why a reboot solved the problem in
the other
cluster you wrote about. Perhaps you should do "uname -a" on all nodes and
make sure
they're all running the same kernel. If working node(s) and the rebooted
node are
both running the same kernel, then they will also be running the same cluster
modules,
i.e., cman, in which case your problem might be a new bug.
Even if they're not running the same kernel, the cman modules have a
compatible
protocol, unless the U1 version of cman is still running on the working
node(s).
If the working node is found to be running the old U1 cman, even if the new
RPMs are
installed, you may want to reboot in order to pick up the new kernel and
cluster modules.
Bob,
Thank you for your responses. They are greatly appreciated. These
systems never ran 4U1 as they were installed as current in June. I ended
up having to get a maintenance downtime, as these are production systems.
As expected, once both systems were rebooted, the cluster regained quorum
with no problem and all services were established without issue. I was
hoping that there might have been a way to regain cluster services without
having to take the service down that runs on the GFS volume, but I
couldn't find a way to do so. Thanks again for your help.
Barry
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster