Re: Clearing a glock

Scooter Morris <scooter@xxxxxxxxxxxx> · Tue, 27 Jul 2010 10:14:47 -0700

 Hi Steve,
    More information.  The offending file was /usr/local/bin/python2.6, 
which we use heavily on all nodes.  Our general use is through the #! 
mechanism in .py files.  Does this offer any clues as to why we had all 
of those processes waiting on a lock with no holder?

-- scooter

On 07/27/2010 06:18 AM, Steven Whitehouse wrote:
Hi,

On Tue, 2010-07-27 at 05:57 -0700, Scooter Morris wrote:
On 7/27/10 5:15 AM, Steven Whitehouse wrote:
Hi,

If you translate a5b67f into decimal, then that is the inode number of
the inode which is causing a problem. It looks to me as if you have too
many processes trying to access this one inode from multiple nodes.

Its not obvious from the traces that anything is actually stuck, but if
you take two traces, a few seconds or minutes apart, then it should
become more obvious whether the cluster is making progress or whether it
really is stuck,

Steve.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
Hi Steve,
      As always, thanks for the reply.  The cluster was, indeed, truly
stuck.  I rebooted it last night to clear everything out.  I never did
figure out which file was the problem.  I did a find -inum, but the find
hung too.  By that point the load average was up to 80 and climbing.
Any ideas on how to avoid this?  Are there tunable values I need to
increase to allow more processes to access any individual inode?

The LA includes processes waiting for glocks since that is an
uninterruptible wait, so thats where most of the LA came from.

The find is unlikely to work while the cluster is stuck, since if it
does find the cuplrit inode, it is, by definition already stuck so the
find process would just join the queue. If a find fails to discover the
inode when the cluster has been rebooted and is back working again, then
it was probably a temporary file of some kind.

There are no tunable values since the limitation on the access to the
inode is the speed of the hardware in terms of how many times a given
inode can be synced, invalidated and the glock passed on to another node
in a given time period. It is a limitation of the hardware and the
architecture of the filesystem.

There are a few things which can probably be improved in due course, but
in the main the best way to avoid problems with congestion on inodes is
just to be careful about the access pattern across nodes.

That said, if it really was completely stuck, that is a real bug and not
the result of the access pattern since the code is designed such that
progress should always be made, even if its painfully slow,

Steve.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster