I ran my test script (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight. It ran 17 test runs before hanging in a rm during a 2 node test. The /gfs_stripe5 is mounted on cl030 and cl031. process 28723 (rm) on cl030 is hung. process 29693 (updatedb) is also hung on cl030. process 29537 (updatedb) is hung on cl031. I have stack traces and lockdump and lock debug output from both nodes here: http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/ gfs_tool/decipher_lockstate_dump cl030.lockdump shows: Glock (inode[2], 39860) gl_flags = gl_count = 6 gl_state = shared[3] lvb_count = 0 object = yes aspace = 2 reclaim = no Holder owner = 28723 gh_state = shared[3] gh_flags = atime[9] error = 0 gh_iflags = promote[1] holder[6] first[7] Waiter2 owner = none[-1] gh_state = unlocked[0] gh_flags = try[0] error = 0 gh_iflags = demote[2] alloced[4] dealloc[5] Waiter3 owner = 29693 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] Inode: busy gfs_tool/decipher_lockstate_dump cl031.lockdump shows: Glock (inode[2], 39860) gl_flags = lock[1] gl_count = 5 gl_state = shared[3] lvb_count = 0 object = yes aspace = 1 reclaim = no Request owner = 29537 gh_state = exclusive[1] gh_flags = local_excl[5] atime[9] error = 0 gh_iflags = promote[1] Waiter3 owner = 29537 gh_state = exclusive[1] gh_flags = local_excl[5] atime[9] error = 0 gh_iflags = promote[1] Inode: busy Is there any documentation on what these fields are? What is the difference between Waiter2 and Waiter3? If I understand this correctly, the updatedb (29537) on cl031 is trying to go from shared -> exclusive while the rm (28723) on cl030 is holding the glock shared and the updatedb (29693) on cl030 is waiting to get the glock shared. Questions: How does one know which node is the master for a lock? Shouldn't the cl030 know (bast) that the updatedb on cl031 is trying to go shared->exclusive? What does the gfs_tool/parse_lockdump script do? I have include the output from /proc/cluster/lock_dlm/debug, but I have no idea what that data is. Any hints? Anything else I can do to debug this further? Thanks, Daniel