On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote: > I ran my test script > (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight. > > It ran 17 test runs before hanging in a rm during a 2 node test. > The /gfs_stripe5 is mounted on cl030 and cl031. > > process 28723 (rm) on cl030 is hung. > process 29693 (updatedb) is also hung on cl030. > > process 29537 (updatedb) is hung on cl031. > > I have stack traces and lockdump and lock debug output > from both nodes here: > > http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/ There's evidently atime updates happening. That's not necessarily a killer, but you might check the system times on your nodes to verify they're the same (or were the same when the tests were running). If they're different enough, then atime updates could take a portion of the blame. > How does one know which node is the master for a lock? echo "name of lockspace" >> /proc/cluster/dlm_locks cat /proc/cluster/dlm_locks > dlm_locks.txt gives a list of all the dlm locks that node knows about. > I have include the output from /proc/cluster/lock_dlm/debug, > but I have no idea what that data is. Any hints? It's a small circular buffer of lock_dlm activity. It can be helpful debugging some problems, but usually not unless you're looking for something specific and recent. -- Dave Teigland <teigland@xxxxxxxxxx>