Re: [Linux-cluster] GFS 2 node hang in rm test

David Teigland <teigland@xxxxxxxxxx> · Sat, 4 Dec 2004 13:56:54 +0800

On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote:
> I ran my test script
> (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight.
> 
> It ran 17 test runs before hanging in a rm during a 2 node test.
> The /gfs_stripe5 is mounted on cl030 and cl031.
> 
> process 28723 (rm) on cl030 is hung.
> process 29693 (updatedb) is also hung on cl030.
> 
> process 29537 (updatedb) is hung on cl031.
> 
> I have stack traces and lockdump and lock debug output
> from both nodes here:
> 
> http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/

There's evidently atime updates happening.  That's not necessarily a
killer, but you might check the system times on your nodes to verify
they're the same (or were the same when the tests were running).  If
they're different enough, then atime updates could take a portion of the
blame.

> How does one know which node is the master for a lock?

echo "name of lockspace" >> /proc/cluster/dlm_locks
cat /proc/cluster/dlm_locks > dlm_locks.txt

gives a list of all the dlm locks that node knows about.

> I have include the output from /proc/cluster/lock_dlm/debug,
> but I have no idea what that data is.  Any hints?

It's a small circular buffer of lock_dlm activity.  It can be helpful
debugging some problems, but usually not unless you're looking for
something specific and recent.

-- 
Dave Teigland  <teigland@xxxxxxxxxx>