Re: RHEL 4.7 fenced fails -- stuck join state: S-2,2,1

"Robert Hurst" <rhurst@xxxxxxxxxxxxxxxxx> · Mon, 24 Aug 2009 12:32:39 -0400

RHEL support pointed me to a document suggesting this may be an implementation issue:

http://kbase.redhat.com/faq/docs/DOC-5935

    "DNS is not a reliable way to get name resolution for the cluster. All cluster nodes must be defined in /etc/hosts with the name that matches cluster.conf and uname -n."

But we use a local DNS service on all of our hosts (for years) and leave /etc/hosts alone with only the localhost entry in it.  Our servers have multiple bonded NICs, so I put the "hosts" in their own private domain / IP address, and each clusternode entry in cluster.conf is simply added as: acropolis, cerberus, rycon, and solaria.  I let DNS (and reverse DNS) resolve those names, i.e.,

/etc/resolve 

search blade  ccc.cluster  bidmc.harvard.edu  bidn.caregroup.org 

nameserver 127.0.0.1

$ host acropolis 

acropolis.blade has address 192.168.2.1 

$ host cerberus 

cerberus.blade has address 192.168.2.4 

$ host rycon 

rycon.blade has address 192.168.2.11 

$ host solaria 

solaria.blade has address 192.168.2.12 

root@acropolis [~]$ netstat -a | grep 6809 

udp    0    0 acropolis.blade:6809    *:* 

udp    0    0 192.168.255.255:6809    *:* 

[root@cerberus ~]# netstat -a | grep 6809 

udp    0    0 cerberus.blade:6809     *:* 

udp    0    0 192.168.255.255:6809    *:* 

[root@solaria ~]# netstat -a | grep 6809 

udp    0    0 solaria.blade:6809      *:* 

udp    0    0 192.168.255.255:6809    *:*

... even though each of those server's $( uname -n ) has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them.  Is this REALLY a cause for concern?  If so, could this introduce a failure (if not at join) during some later event?  Any feedback is welcome!

On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote:

    Simple 4-node cluster, 2-nodes have a GFS shared home directory mounted for over a month.  Today, I wanted to mount /home on a 3rd node, so:

    # service fenced start                [failed]

    Weird.  Checking /var/log/messages show:

    Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:16) installed

    Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:32) installed

    Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster "lock_dlm", "ccc_cluster47:home"

    Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18) installed

    Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found; check fenced

    Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm, table = ccc_cluster47:home, hostdata = 

    # cman_tool services

    Service          Name                              GID LID State     Code

    Fence Domain:    "default"                           0   2 join      S-2,2,1

    []

    So, a fenced process is now hung:

    root     28302  0.0  0.0  3668  192 ?        Ss   10:19   0:00 fenced -t 120 -w

    Q: Any idea how to "recover" from this state, without rebooting?

    The other two servers are unaffected by this (thankfully) and show normal operations:

    $ cman_tool services

    Service          Name                              GID LID State     Code

    Fence Domain:    "default"                           2   2 run       -

    [1 12]

    DLM Lock Space:  "home"                              5   5 run       -

    [1 12]

    GFS Mount Group: "home"                              6   6 run       -

    [1 12]

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster