Sutton, Harry (MSE) wrote: > The most recent set of patches for RHCS, comprising: > > RHBA-2008:0093 dlm-kernel bug fix update > RHBA-2008:0092 cman-kernel bug fix update > RHBA-2008:0060 cman bug fix update > RHBA-2008:0095 gnbd-kernel bug fix update > RHBA-2008:0096 GFS-kernel bug fix update > RHSA-2008:0055 Important: kernel security and bug fix update > > has resulted in a problem in my two-node (production) cluster. Let me > explain ;-) > > I have a three-node test cluster where I install all patches before > rolling them into my (two-node) production cluster; I know, I know, > they're not the same, and that's the only difference I can see in what > has happened here (a first in two years). In the three-node cluster > (which, just to complicate things, only had two active nodes at the > time), I rolled these patches through the two nodes without taking the > whole cluster down. That is: > > 1. Stop all cluster services on Node A. Disable auto-start using > chkconfig off <cluster-service-name>. Services stop successfully, Node A > leaves the cluster, Node B continues running all shared cluster services > (GFS, Fibre-channel-connected shared storage, HP MSA1000). > 2. Patch Node A, reboot to new kernel, re-install HP-supplied QLogic > driver, edit /etc/modprobe.conf for failover settings, rebuild initrd > file for QLogic drivers, reboot, re-enable auto-start of cluster > services, reboot once more and the cluster re-forms. > 3. Repeat Steps 1 and 2 for Node B > 4. Cluster is restored to normal operation, both nodes fully patched. > > On my production cluster, which uses a Quorum Disk in place of the third > node, I completed steps 1 and 2 on Node A, but the cluster did NOT > reform. cman sends out its advertisement, and I can see that Node B > receives it (by looking at the tcpdump traces), but Node B never responds. > > So: before I take down Node B (which is currently the only one running > my production services), can someone either (a) explain why the cluster > is not re-forming, or (b) assure me that by restoring both systems to > the same patch level, the cluster WILL reform properly? (Which begs the > question: why did my test cluster survive the patch process and my > production cluster didn't? Same versions of everything......) > > Thanks in advance, and best regards, I'm pretty certain that even simply rebooting node B will let the cluster re-form. I've heard of this problem before but never got to the bottom of it because it seems to be quite rare. It is almost certainly some state in node B that is preventing it replying to node A's join requests - I suspect it's a bug to do with protocol ACK numbers but can't be sure. Before you do it, would you be so kind as to send me the tcpdumps of the (non-)conversation, including the HELLO messages from node B. It might help in tracking it down. Thanks, -- Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster