Hello Sebastian, what do gfs_tool counters on the fs tell you? And ps axf? Do you have a lot of "D" processes? Regards Marc. On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote: > Dear list, > > this is the tragical story of my cluster running rhel/csgfs 4u5: the > cluster in generally is running fine, but when I increase the load to a > certain level (heavy I/O), it collapses. About 20% of the nodes do crash > (not reacting any more, but no sign of kernel panic), the others can't > access the gfs resource. > Gfs is set up as a rgmanager service with failover domain for each node > (same problem also exists when mounting via /etc/fstab). > > Who is willing to provide a happy end? > > Thanks, Sebastian > ** > > This is what /var/log/messages gives me (on nearly all nodes): > Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting > status for RG gfs-2 > and e.g. > Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain > cluster lock: Connection timed out > > [root@compute-0-3 ~]# cat /proc/cluster/status > Protocol version: 5.0.1 > Config version: 53 > Cluster name: dtm > Cluster ID: 741 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 10 > Expected_votes: 11 > Total_votes: 10 > Quorum: 6 > Active subsystems: 8 > Node name: compute-0-3 > Node ID: 4 > Node addresses: 10.1.255.252 > > [root@compute-0-6 ~]# cat /proc/cluster/services > Service Name GID LID State Code > Fence Domain: "default" 3 2 recover 4 - > [1 2 6 10 9 8 3 7 4 11] > DLM Lock Space: "clvmd" 7 3 recover 0 - > [1 2 6 10 9 8 3 7 4 11] > DLM Lock Space: "Magma" 12 5 recover 0 - > [1 2 6 10 9 8 3 7 4 11] > DLM Lock Space: "homeneu" 17 6 recover 0 - > [10 9 8 7 2 3 6 4 1 11] > GFS Mount Group: "homeneu" 18 7 recover 0 - > [10 9 8 7 2 3 6 4 1 11] > User: "usrm::manager" 11 4 recover 0 - > [1 2 6 10 9 8 3 7 4 11] > > [root@compute-0-10 ~]# cat /proc/cluster/dlm_stats > DLM stats (HZ=1000) > > Lock operations: 4036 > Unlock operations: 2001 > Convert operations: 1862 > Completion ASTs: 7898 > Blocking ASTs: 52 > > Lockqueue num waittime ave > WAIT_RSB 3778 28862 7 > WAIT_CONV 75 482 6 > WAIT_GRANT 2171 7235 3 > WAIT_UNLOCK 153 1606 10 > Total 6177 38185 6 > > [root@compute-0-10 ~]# cat /proc/cluster/sm_debug > sevent state 7 > 02000012 sevent state 9 > 00000003 remove node 5 count 10 > 01000011 remove node 5 count 10 > 0100000c remove node 5 count 10 > 01000007 remove node 5 count 10 > 02000012 remove node 5 count 10 > 0300000b remove node 5 count 10 > 00000003 recover state 0 > > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster