Hi Marc, thanks for your answer. Yesterday night, I set the 'gfs_tool settune glock_purge 50'. After putting some stress on the cluster, one node now failed, while gfs is still running, but very, very slow (maybe it will collapse soon). /proc/cluster/services tells me that the nodes are all still ok, and the failed doesn't get fenced (normally, fencing works well in the environment). But I can't ssh into the node, and on the console I can't login (even if it still let's me type it doesn't react). The main jobs of the cluster went to 'D' state. This is what `gfs_tool counters /global/home` gives me (this command also hang, but after restarting the failed node it worked): [root@dtm ~]# gfs_tool counters /global/home locks 23532 locks held 10774 freeze count 0 incore inodes 10764 metadata buffers 4 unlinked inodes 1 quota IDs 2 incore log buffers 0 log space used 0.10% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 2 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 421095 glock nq calls 2196642 glock dq calls 2051456 glock prefetch calls 171529 lm_lock calls 345420 lm_unlock calls 155518 lm callbacks 515665 address operations 4431 dentry operations 377484 export operations 0 file operations 535012 inode operations 892178 super operations 1196275 vm operations 0 block I/O reads 126680 block I/O writes 133808 Any suggestions? Regards, Sebastian Marc Grimme wrote: > Hi Sebastian, > you might also want to have a look at here: > http://www.open-sharedroot.org/Members/marc/blog/blog-on-gfs/ > I collected some information about the problem you've hit (It must be that > problem). > Next time you should also look at the console of every node. You should see > some intesting messages before there. > Use the glock_purge gfs_tool option it will help and always keep a look on the > gfs_tool counters and there on the locks. > > BTW: the unable to obtain lock is only the rgmanager complaining about not > being able to obtain a lock and as side effect. The problem is that a new > lockid cannot be got within time. > > Regards Marc. > On Sunday 19 August 2007 11:53:39 you wrote: > >> Hi Marc! >> >> Thanks for your help. As I restarted everything now, I can't check this. >> I will do when it's crahsing again (I will do some tests now). I >> realised that one node did hang with kernel panic. Attached is the >> screenshot. >> >> regards >> sebastian >> >> Marc Grimme wrote: >> >>> Hello Sebastian, >>> what do gfs_tool counters on the fs tell you? >>> And ps axf? Do you have a lot of "D" processes? >>> Regards Marc. >>> >>> On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote: >>> >>>> Dear list, >>>> >>>> this is the tragical story of my cluster running rhel/csgfs 4u5: the >>>> cluster in generally is running fine, but when I increase the load to a >>>> certain level (heavy I/O), it collapses. About 20% of the nodes do crash >>>> (not reacting any more, but no sign of kernel panic), the others can't >>>> access the gfs resource. >>>> Gfs is set up as a rgmanager service with failover domain for each node >>>> (same problem also exists when mounting via /etc/fstab). >>>> >>>> Who is willing to provide a happy end? >>>> >>>> Thanks, Sebastian >>>> ** >>>> >>>> This is what /var/log/messages gives me (on nearly all nodes): >>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting >>>> status for RG gfs-2 >>>> and e.g. >>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain >>>> cluster lock: Connection timed out >>>> >>>> [root@compute-0-3 ~]# cat /proc/cluster/status >>>> Protocol version: 5.0.1 >>>> Config version: 53 >>>> Cluster name: dtm >>>> Cluster ID: 741 >>>> Cluster Member: Yes >>>> Membership state: Cluster-Member >>>> Nodes: 10 >>>> Expected_votes: 11 >>>> Total_votes: 10 >>>> Quorum: 6 >>>> Active subsystems: 8 >>>> Node name: compute-0-3 >>>> Node ID: 4 >>>> Node addresses: 10.1.255.252 >>>> >>>> [root@compute-0-6 ~]# cat /proc/cluster/services >>>> Service Name GID LID State >>>> Code Fence Domain: "default" 3 2 recover >>>> 4 - [1 2 6 10 9 8 3 7 4 11] >>>> DLM Lock Space: "clvmd" 7 3 recover 0 - >>>> [1 2 6 10 9 8 3 7 4 11] >>>> DLM Lock Space: "Magma" 12 5 recover 0 - >>>> [1 2 6 10 9 8 3 7 4 11] >>>> DLM Lock Space: "homeneu" 17 6 recover 0 - >>>> [10 9 8 7 2 3 6 4 1 11] >>>> GFS Mount Group: "homeneu" 18 7 recover 0 - >>>> [10 9 8 7 2 3 6 4 1 11] >>>> User: "usrm::manager" 11 4 recover 0 - >>>> [1 2 6 10 9 8 3 7 4 11] >>>> >>>> [root@compute-0-10 ~]# cat /proc/cluster/dlm_stats >>>> DLM stats (HZ=1000) >>>> >>>> Lock operations: 4036 >>>> Unlock operations: 2001 >>>> Convert operations: 1862 >>>> Completion ASTs: 7898 >>>> Blocking ASTs: 52 >>>> >>>> Lockqueue num waittime ave >>>> WAIT_RSB 3778 28862 7 >>>> WAIT_CONV 75 482 6 >>>> WAIT_GRANT 2171 7235 3 >>>> WAIT_UNLOCK 153 1606 10 >>>> Total 6177 38185 6 >>>> >>>> [root@compute-0-10 ~]# cat /proc/cluster/sm_debug >>>> sevent state 7 >>>> 02000012 sevent state 9 >>>> 00000003 remove node 5 count 10 >>>> 01000011 remove node 5 count 10 >>>> 0100000c remove node 5 count 10 >>>> 01000007 remove node 5 count 10 >>>> 02000012 remove node 5 count 10 >>>> 0300000b remove node 5 count 10 >>>> 00000003 recover state 0 >>>> >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> > > > > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster