On Monday 20 August 2007 18:19:31 Sebastian Walter wrote: > Hi, > > after putting massive load on the cluster, 55 % of the nodes died again > (after adjusting the glock_purge to 50). I don't think (and hope) that > it's the hardware, as normal filesystems don't make problems and running > it with low load also runs fine. I will check this, but it will be a > more comprehensive task. Maybe I can improve by tuning the volume better? > > Here is what /var/log/messages gives me: > Aug 20 16:24:50 compute-0-10.local clurgmgrd[4283]: <err> #48: Unable to > obtain cluster lock: Connection timed out > Aug 20 16:25:04 compute-0-3.local clurgmgrd[4280]: <err> #48: Unable to > obtain cluster lock: Connection timed out > Aug 20 16:25:35 compute-0-10.local clurgmgrd[4283]: <err> #50: Unable to > obtain cluster lock: Connection timed out > Aug 20 16:25:49 compute-0-3.local clurgmgrd[4280]: <err> #50: Unable to > obtain cluster lock: Connection timed out > (these are the errors from the still running nodes, they are repeated > several times) > > gfs_tool counters /global/home is blocked and not responding. Btw, I'm > running CentOS 4 Update 5 on all the nodes. > > Thanks for any comment. Regards, > Sebastian > > Wendy Cheng wrote: > > Sebastian Walter wrote: > >>>>>> This is what /var/log/messages gives me (on nearly all nodes): > >>>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed > >>>>>> getting > >>>>>> status for RG gfs-2 > >>>>>> and e.g. > >>>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to > >>>>>> obtain > >>>>>> cluster lock: Connection timed out > > > > GFS glock trimming patch *could* help. However, the lock leak *here* > > is from clurgmgrd (cluster infrastructure), not GFS (filesystem) > > itself. So these two are different issues. I vaguely recall clurgmgrd > > did have a bugzilla for this and was fixed sometime ago. > > > > Lon ? > > > > -- Wendy > > > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster Do you also see some messages on the console of the nodes. And the gfs_tool counters would help before that problem occures. So let it run sometimes before to see if locks increase. What kind of stress tests are you doing? I bet searching the whole filesystem. What makes me wonder is that the gfs_tool glock_purge does not work whereas it worked for me with exactly the same problems. Did you set it _AFTER_ the fs was mounted? Regards Marc. -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster