How to configure a cluster to remain up in the event of node failure

"Brett Cave" <brettcave@xxxxxxxxx> · Tue, 12 Aug 2008 16:59:03 +0200

With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
startup, although the cluster seems to return to normal.
Nodes: node2, node3, node4
each node has 1 vote, and a qdisk has 2 votes.

If I reset node3, gfs on node2 and node4 is blocked while node3
restarts. First question: is there a config that will allow the
cluster to continue operating while 1 node is down? My quorum is 3 and
total votes is 4 while node3 is restarting, but my gfs mountpoints are
inaccessible until my cman services start up on node3.

Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
Starting cman
Mounting configfs...done
Starting ccsd...done
Starting cman...done
Starting daemons...done
Starting fencing...done
                   OK
qdiskd        OK

"Mounting other file systems..." OK

Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
Trying to join cluster "lock_dlm","jemdevcluster:cache1"
dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection to 2
dlm: connecting to 2
dlm: got connection from 4

After that, system just hangs.

>From nodes 2 & 4, i can run cman_tool, and everything shows that the
cluster is up, except for some services:
[root@node2 cache1]# cman_tool services
type             level name     id       state
fence            0     default  00010004 none
[2 3 4]
dlm              1     cache1   00010003 none
[2 3 4]
dlm              1     storage  00030003 none
[2 4]
gfs              2     cache1   00000000 none
[2 3 4]
gfs              2     storage  00020003 none
[2 4]

[root@node2 cache1]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2008-08-12 16:11:46  /dev/sda5
   2   M    336   2008-08-12 16:11:12  node2
   3   M    352   2008-08-12 16:44:31  node3
   4   M    344   2008-08-12 16:11:12  node4

I have 2 gfs partitions
[root@node4 CentOS]# grep gfs /etc/fstab
/dev/sda1       /gfs/cache1                     gfs     defaults
                         0 0
/dev/sda2       /gfs/storage                    gfs     defaults
                         0 0

At this point, I am unable to unmount /gfs/cache1 from any of my nodes
(node2 or node4) - it just hangs. I can unmount storage with no
problem.

Is there something I am overlooking? Any and all advice welcome :)

Regards,
Brett

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster