I think I found a problem with the way it starts up... See just below the startup output for more info... On Tue, Aug 12, 2008 at 4:59 PM, Brett Cave <brettcave@xxxxxxxxx> wrote: > With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on > startup, although the cluster seems to return to normal. > Nodes: node2, node3, node4 > each node has 1 vote, and a qdisk has 2 votes. > > If I reset node3, gfs on node2 and node4 is blocked while node3 > restarts. First question: is there a config that will allow the > cluster to continue operating while 1 node is down? My quorum is 3 and > total votes is 4 while node3 is restarting, but my gfs mountpoints are > inaccessible until my cman services start up on node3. > > Secondly, when node3 restarts, it hangs when trying to remount gfs file systems. > Starting cman > Mounting configfs...done > Starting ccsd...done > Starting cman...done > Starting daemons...done > Starting fencing...done > OK > qdiskd OK > > "Mounting other file systems..." OK > > Mounting GFS filesystems: GFS 0.1.1-7.el5 installed > Trying to join cluster "lock_dlm","jemdevcluster:cache1" > dlm: Using TCP for communications > dlm: connecting to 2 > dlm: got connection to 2 > dlm: connecting to 2 > dlm: got connection from 4 Could this be the problem? When GFS is set to auto start via chkconfig, it first tries to connnect to 2, gets connection and then tries to connect to 2 again. It gets a connection from 4, and hangs. However, if I chkconfig --levels 3 gfs off and then run service gfs start once system has booted, i get: dlm: connecting to 2 dlm: got connection from 2 dlm: connecting to 4 dlm: got connection from 4 mounting gfs mountpoints. This works exactly as expected - gfs mounts, and cluster is back to normal. This means that for some reason, when gfs is starting as an automatic boot service, it doesnt connect to nodes properly - trying to connect to node2 twice, rather than node2 and then node4 as it should. Why would it be doing this? where would i start for troubleshooting something like. > > After that, system just hangs. > > From nodes 2 & 4, i can run cman_tool, and everything shows that the > cluster is up, except for some services: > [root@node2 cache1]# cman_tool services > type level name id state > fence 0 default 00010004 none > [2 3 4] > dlm 1 cache1 00010003 none > [2 3 4] > dlm 1 storage 00030003 none > [2 4] > gfs 2 cache1 00000000 none > [2 3 4] > gfs 2 storage 00020003 none > [2 4] > > [root@node2 cache1]# cman_tool nodes > Node Sts Inc Joined Name > 0 M 0 2008-08-12 16:11:46 /dev/sda5 > 2 M 336 2008-08-12 16:11:12 node2 > 3 M 352 2008-08-12 16:44:31 node3 > 4 M 344 2008-08-12 16:11:12 node4 > > I have 2 gfs partitions > [root@node4 CentOS]# grep gfs /etc/fstab > /dev/sda1 /gfs/cache1 gfs defaults > 0 0 > /dev/sda2 /gfs/storage gfs defaults > 0 0 > > > At this point, I am unable to unmount /gfs/cache1 from any of my nodes > (node2 or node4) - it just hangs. I can unmount storage with no > problem. > > Is there something I am overlooking? Any and all advice welcome :) > > Regards, > Brett > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster