http://kbase.redhat.com/faq/docs/DOC-5935
"DNS is not a reliable way to get name resolution for the cluster. All cluster nodes must be defined in /etc/hosts with the name that matches cluster.conf and uname -n."
But we use a local DNS service on all of our hosts (for years) and leave /etc/hosts alone with only the localhost entry in it. Our servers have multiple bonded NICs, so I put the "hosts" in their own private domain / IP address, and each clusternode entry in cluster.conf is simply added as: acropolis, cerberus, rycon, and solaria. I let DNS (and reverse DNS) resolve those names, i.e.,
/etc/resolve
search blade ccc.cluster bidmc.harvard.edu bidn.caregroup.org
nameserver 127.0.0.1
$ host acropolis
acropolis.blade has address 192.168.2.1
$ host cerberus
cerberus.blade has address 192.168.2.4
$ host rycon
rycon.blade has address 192.168.2.11
$ host solaria
solaria.blade has address 192.168.2.12
root@acropolis [~]$ netstat -a | grep 6809
udp 0 0 acropolis.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
[root@cerberus ~]# netstat -a | grep 6809
udp 0 0 cerberus.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
[root@solaria ~]# netstat -a | grep 6809
udp 0 0 solaria.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
... even though each of those server's $( uname -n ) has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them. Is this REALLY a cause for concern? If so, could this introduce a failure (if not at join) during some later event? Any feedback is welcome!
On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote:
Simple 4-node cluster, 2-nodes have a GFS shared home directory mounted for over a month. Today, I wanted to mount /home on a 3rd node, so:
# service fenced start [failed]
Weird. Checking /var/log/messages show:
Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:16) installed
Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22 2009 18:39:32) installed
Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster "lock_dlm", "ccc_cluster47:home"
Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18) installed
Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found; check fenced
Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm, table = ccc_cluster47:home, hostdata =
# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 0 2 join S-2,2,1
[]
So, a fenced process is now hung:
root 28302 0.0 0.0 3668 192 ? Ss 10:19 0:00 fenced -t 120 -w
Q: Any idea how to "recover" from this state, without rebooting?
The other two servers are unaffected by this (thankfully) and show normal operations:
$ cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 2 2 run -
[1 12]
DLM Lock Space: "home" 5 5 run -
[1 12]
GFS Mount Group: "home" 6 6 run -
[1 12]
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster