Thought I had it worked out, but things still not working 100%. Setup: 3node gfs cluster, each node has 1 vote, and quorum disk has 2 votes. Cluster is up and running with no problem. I then reboot 1 node. For troubleshooting purposes, I turned gfs off from default startup so i can start it manually (cman and qdiskd is still automatically started). nodes are 2,3 and 4. Node2 is being restarted. The logs all show node leaving successfully. Quorum is 3, expected votes 5, total votes 4 (once node has shut down). Node2 restarted, cman and qdiskd starts up. at this point, cluster services show everything back to normal. Output from cman_tool status on all 3 nodes is the same, with no errors (output abbreviated here). # cman_tool status Config Version: 5 Nodes: 3 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 7 However, when I run service gfs start (or try and mount my first gfs volume), it just hangs. My logs on node 2 show the following: Aug 13 19:51:42 blade2 gfs_controld[2825]: retrieve_plocks: ckpt open error 12 cache1 Aug 13 19:51:42 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007 14:43:37) installed Aug 13 19:51:42 blade2 kernel: Trying to join cluster "lock_dlm", "jemdevcluster:cache1" Aug 13 19:51:42 blade2 kernel: dlm: Using TCP for communications Aug 13 19:51:42 blade2 kernel: dlm: got connection from 3 Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4 Aug 13 19:51:42 blade2 kernel: dlm: got connection from 4 Aug 13 19:51:42 blade2 kernel: dlm: connecting to 4 At this point, mount.gfs just hangs. Restarting node2 causes the same thing to happen over and over, and am not able to get the 2 gfs volumes mounted. Nodes3 & 4 can still access the filesystem however. After a 2nd reboot, my logs show... Aug 13 20:13:08 blade2 qdiskd[2873]: <info> Node 3 is the master Aug 13 20:13:09 blade2 gfs_controld[2834]: retrieve_plocks: ckpt open error 12 cache1 Aug 13 20:13:09 blade2 kernel: GFS 0.1.19-7.el5 (built Nov 12 2007 14:43:37) installed Aug 13 20:13:09 blade2 kernel: Trying to join cluster "lock_dlm", "jemdevcluster:cache1" Aug 13 20:13:09 blade2 kernel: dlm: Using TCP for communications Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3 Aug 13 20:13:09 blade2 kernel: dlm: got connection from 3 Aug 13 20:13:09 blade2 kernel: dlm: connecting to 3 Aug 13 20:13:09 blade2 kernel: dlm: got connection from 4 Both 3 & 4 show the following in the logs: Aug 13 20:14:17 blade4 openais[2554]: [CLM ] Members Joined: Aug 13 20:14:17 blade4 openais[2554]: [CLM ] r(0) ip(192.168.70.102) Aug 13 20:14:17 blade4 openais[2554]: [SYNC ] This node is within the primary component and will provide service. Aug 13 20:14:17 blade4 openais[2554]: [TOTEM] entering OPERATIONAL state. Aug 13 20:14:17 blade4 openais[2554]: [CLM ] got nodejoin message 192.168.70.102 Aug 13 20:14:17 blade4 openais[2554]: [CLM ] got nodejoin message 192.168.70.103 Aug 13 20:14:17 blade4 openais[2554]: [CLM ] got nodejoin message 192.168.70.104 Aug 13 20:14:17 blade4 openais[2554]: [CPG ] got joinlist message from node 4 Aug 13 20:14:17 blade4 openais[2554]: [CPG ] got joinlist message from node 3 Aug 13 20:14:33 blade4 kernel: dlm: connecting to 2 Surely node2 should connect to 3, get connection from 3 and then connect to 4 and get connection from 4? Could this possibly be a gfs bug? Brett -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster