Re: GFS filesystem "hang" with cluster-1.03.00

Ramon van Alteren <ramon@xxxxxxxxxxxxx> · Fri, 20 Oct 2006 15:34:33 +0200

Hi Josef,

Josef Whiter wrote:
In your previous message you asked about the latency.  With gfs1, there is a
certain amount of latency involved with stat calls, so ls -al, du, df all take a
great deal of time comparitively.  With these calls first you have to traverse
the FS in order to cache all the inforamation about the files, so every lookup
requires a lock on each directory to the file, and then a lock on the file
itself inorder to read its information off of the disk.  Then thats just the
lookup, then we have to grab a shared lock again to get the stat information
from the file.  Each lock mind you requires exporting the lock to all of the
other nodes so they know about it and getting confirmation back on that lock.
So for every stat lookup you are looking at at the very least 2 seperate locks,
one for the lookup and then one for the stat.  Every subsequent call is faster
because the lookups no longer require the locks to lookup the file, as the inode
information is now cached, so we just need the lock for the file.

Yes, this was the previous exchange. In the same exchange I was advised 
by Wendy Cheng to switch to iozone because it would avoid such multiple 
lock calls (less stat on files) instead of bonnie++ which we were 
previously testing with.

My test-run last night started at 4AM with 4 different iozone processes 
using a temp file in different directories on the same filesystem / 
logical volume.

AFAIK this would avoid the problem you mention above ?

All iozone processes were in D (uninteruptable sleep) by the time I woke 
up and had a look 8AM this morning.

I would expect gfs to deal with this gracefully and return a performance 
metric on multiple writes because:

* Not in same directory so no dir-lock to pass around
* Different files

If gfs_tool counters is stuck, you'll want to get a couple instances of sysrq-t
from all nodes and see if you can see who is hanging, wether its in D state or
if the particular process isn't makeing progress.

I interrupted the processes by now and found one node that was hanging.
I'm still completely clueless as to what is causing this.

Any pointers and/or ideas on where to look, testcases to run or any info 
at all that might be helpfull in finding the cause of the problem would 
be much appreciated.

I'm also opening a similar case with the coraid support department to 
see if they have something to say about it.

Thanx,

Ramon

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster