Fencing woes

Jan Bruvoll <jan@xxxxxxxxxxx> · Mon, 22 Aug 2005 21:19:52 +0200

Dear list,

I am having problems with a node where I can't get it to rejoin the
fence domain. It has been rebooted before, and it has so far
automatically joined the fence domain so that that it could pick up the
rest of the depending services, but not this time. I upgraded the kernel
and cluster/GFS suite (this is a Gentoo system) to
gentoo-sources-2.6.12-r9 and cluster software v1.00.00.

I guess the biggest problem is that I don't know what to actually do to
unfence the node that has been shut out. Since I have set the cluster up
to use manual fencing, I suppose the un-fence command to use is
fence_ack_manual, however using that only produces a warning about a
missing /tmp/fence_manual.fifo. Manually creating this fifo before
running the command only removes the fifo -and- produces the warning.

This is what a cman_tool services emits:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-2,2,1
[]

I don't seem to be able to find any information anywhere on the "Codes"
- any pointers there?

The cluster has 6 members: one "file server" and five "clients". Excerpt
from cluster.conf follows:

<?xml version="1.0"?>
<cluster name="nbs-sc-1" config_version="1">

  <cman></cman>

  <dlm></dlm>

  <clusternodes>
    <clusternode name="fs-2" votes="2">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.200"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="app-1" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.202"/>
        </method>
      </fence>
    </clusternode>
    [...]
    <clusternode name="app-5" votes="1">
      <fence>
        <method name="single">
          <device name="human" ipaddr="10.42.0.206"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fence_devices>
    <device name="human" agent="fence_manual"/>
  </fence_devices>
</cluster>

I also found this from dmesg - is this important?:
SM: process_reply invalid id=0 nodeid=4
SM: process_reply invalid id=0 nodeid=1
SM: process_reply invalid id=0 nodeid=2
SM: process_reply invalid id=0 nodeid=6
SM: process_reply invalid id=0 nodeid=5

Any help or pointers to more information would be most appreciated. I
have read through everything I could find on the i'net without becoming
much wiser, and the status today is that I can't upgrade single servers
in my cluster without taking down the whole group - which is hardly useful.

Thanks in advance for any assistance!

Best regards
Jan Bruvoll

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster