Re: RHEL 4.7 fenced fails -- stuck join state: S-2,2,1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



RHEL support pointed me to a document suggesting this may be an implementation issue:

http://kbase.redhat.com/faq/docs/DOC-5935

"DNS is not a reliable way to get name resolution for the cluster. All cluster nodes must be defined in /etc/hosts with the name that matches cluster.conf and uname -n."



But we use a local DNS service on all of our hosts (for years) and leave /etc/hosts alone with only the localhost entry in it.  Our servers have multiple bonded NICs, so I put the "hosts" in their own private domain / IP address, and each clusternode entry in cluster.conf is simply added as: acropolis, cerberus, rycon, and solaria.  I let DNS (and reverse DNS) resolve those names, i.e.,

/etc/resolve
search blade  ccc.cluster  bidmc.harvard.edu  bidn.caregroup.org
nameserver 127.0.0.1

$ host acropolis
acropolis.blade has address 192.168.2.1
$ host cerberus
cerberus.blade has address 192.168.2.4
$ host rycon
rycon.blade has address 192.168.2.11
$ host solaria
solaria.blade has address 192.168.2.12
 
root@acropolis [~]$ netstat -a | grep 6809
udp    0    0 acropolis.blade:6809    *:*
udp    0    0 192.168.255.255:6809    *:*
 
[root@cerberus ~]# netstat -a | grep 6809
udp    0    0 cerberus.blade:6809     *:*
udp    0    0 192.168.255.255:6809    *:*
 
[root@solaria ~]# netstat -a | grep 6809
udp    0    0 solaria.blade:6809      *:*
udp    0    0 192.168.255.255:6809    *:*


... even though each of those server's $( uname -n ) has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them.  Is this REALLY a cause for concern?  If so, could this introduce a failure (if not at join) during some later event?  Any feedback is welcome!


On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote:
Simple 4-node cluster, 2-nodes have a GFS shared home directory mounted for over a month.  Today, I wanted to mount /home on a 3rd node, so:

# service fenced start                [failed]

Weird.  Checking /var/log/messages show:

Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:16) installed
Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:32) installed
Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster "lock_dlm", "ccc_cluster47:home"
Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18) installed
Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found; check fenced
Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm, table = ccc_cluster47:home, hostdata =

# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-2,2,1
[]

So, a fenced process is now hung:

root     28302  0.0  0.0  3668  192 ?        Ss   10:19   0:00 fenced -t 120 -w

Q: Any idea how to "recover" from this state, without rebooting?

The other two servers are unaffected by this (thankfully) and show normal operations:

$ cman_tool services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 12]

DLM Lock Space:  "home"                              5   5 run       -
[1 12]

GFS Mount Group: "home"                              6   6 run       -
[1 12]

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux