Hi, after putting massive load on the cluster, 55 % of the nodes died again (after adjusting the glock_purge to 50). I don't think (and hope) that it's the hardware, as normal filesystems don't make problems and running it with low load also runs fine. I will check this, but it will be a more comprehensive task. Maybe I can improve by tuning the volume better? Here is what /var/log/messages gives me: Aug 20 16:24:50 compute-0-10.local clurgmgrd[4283]: <err> #48: Unable to obtain cluster lock: Connection timed out Aug 20 16:25:04 compute-0-3.local clurgmgrd[4280]: <err> #48: Unable to obtain cluster lock: Connection timed out Aug 20 16:25:35 compute-0-10.local clurgmgrd[4283]: <err> #50: Unable to obtain cluster lock: Connection timed out Aug 20 16:25:49 compute-0-3.local clurgmgrd[4280]: <err> #50: Unable to obtain cluster lock: Connection timed out (these are the errors from the still running nodes, they are repeated several times) gfs_tool counters /global/home is blocked and not responding. Btw, I'm running CentOS 4 Update 5 on all the nodes. Thanks for any comment. Regards, Sebastian Wendy Cheng wrote: > Sebastian Walter wrote: > >> >> >>>>>> >>>>>> This is what /var/log/messages gives me (on nearly all nodes): >>>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed >>>>>> getting >>>>>> status for RG gfs-2 >>>>>> and e.g. >>>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to >>>>>> obtain >>>>>> cluster lock: Connection timed out >>>>>> >>>>>> > > GFS glock trimming patch *could* help. However, the lock leak *here* > is from clurgmgrd (cluster infrastructure), not GFS (filesystem) > itself. So these two are different issues. I vaguely recall clurgmgrd > did have a bugzilla for this and was fixed sometime ago. > > Lon ? > > -- Wendy > > > > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster