Re: GFS hangs, nodes die

Marc Grimme <grimme@xxxxxxx> · Mon, 20 Aug 2007 20:52:42 +0200

On Monday 20 August 2007 18:19:31 Sebastian Walter wrote:
> Hi,
>
> after putting massive load on the cluster, 55 % of the nodes died again
> (after adjusting the glock_purge to 50). I don't think (and hope) that
> it's the hardware, as normal filesystems don't make problems and running
> it with low load also runs fine. I will check this, but it will be a
> more comprehensive task. Maybe I can improve by tuning the volume better?
>
> Here is what /var/log/messages gives me:
> Aug 20 16:24:50 compute-0-10.local clurgmgrd[4283]: <err> #48: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:04 compute-0-3.local clurgmgrd[4280]: <err> #48: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:35 compute-0-10.local clurgmgrd[4283]: <err> #50: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:49 compute-0-3.local clurgmgrd[4280]: <err> #50: Unable to
> obtain cluster lock: Connection timed out
> (these are the errors from the still running nodes, they are repeated
> several times)
>
> gfs_tool counters /global/home is blocked and not responding. Btw, I'm
> running CentOS 4 Update 5 on all the nodes.
>
> Thanks for any comment. Regards,
> Sebastian
>
> Wendy Cheng wrote:
> > Sebastian Walter wrote:
> >>>>>> This is what /var/log/messages gives me (on nearly all nodes):
> >>>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed
> >>>>>> getting
> >>>>>> status for RG gfs-2
> >>>>>> and e.g.
> >>>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to
> >>>>>> obtain
> >>>>>> cluster lock: Connection timed out
> >
> > GFS glock trimming patch *could* help. However, the lock leak *here*
> > is from clurgmgrd (cluster infrastructure), not GFS (filesystem)
> > itself. So these two are different issues. I vaguely recall clurgmgrd
> > did have a bugzilla for this and was fixed sometime ago.
> >
> > Lon ?
> >
> > -- Wendy
> >
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

Do you also see some messages on the console of the nodes. And the gfs_tool 
counters would help before that problem occures. So let it run sometimes 
before to see if locks increase.
What kind of stress tests are you doing? I bet searching the whole filesystem.
What makes me wonder is that the gfs_tool glock_purge does not work whereas it 
worked for me with exactly the same problems. Did you set it _AFTER_ the fs 
was mounted?

Regards Marc.

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster