On Fri, 2004-12-03 at 15:08, Daniel McNeil wrote: > I ran my test script > (http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight. > > It ran 17 test runs before hanging in a rm during a 2 node test. > The /gfs_stripe5 is mounted on cl030 and cl031. > > process 28723 (rm) on cl030 is hung. > process 29693 (updatedb) is also hung on cl030. > > process 29537 (updatedb) is hung on cl031. > > I have stack traces and lockdump and lock debug output > from both nodes here: > > http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/ > > > gfs_tool/decipher_lockstate_dump cl030.lockdump shows: > > Glock (inode[2], 39860) > gl_flags = > gl_count = 6 > gl_state = shared[3] > lvb_count = 0 > object = yes > aspace = 2 > reclaim = no > Holder > owner = 28723 > gh_state = shared[3] > gh_flags = atime[9] > error = 0 > gh_iflags = promote[1] holder[6] first[7] > Waiter2 > owner = none[-1] > gh_state = unlocked[0] > gh_flags = try[0] > error = 0 > gh_iflags = demote[2] alloced[4] dealloc[5] > Waiter3 > owner = 29693 > gh_state = shared[3] > gh_flags = any[3] > error = 0 > gh_iflags = promote[1] > Inode: busy > > gfs_tool/decipher_lockstate_dump cl031.lockdump shows: > > Glock (inode[2], 39860) > gl_flags = lock[1] > gl_count = 5 > gl_state = shared[3] > lvb_count = 0 > object = yes > aspace = 1 > reclaim = no > Request > owner = 29537 > gh_state = exclusive[1] > gh_flags = local_excl[5] atime[9] > error = 0 > gh_iflags = promote[1] > Waiter3 > owner = 29537 > gh_state = exclusive[1] > gh_flags = local_excl[5] atime[9] > error = 0 > gh_iflags = promote[1] > Inode: busy > > Is there any documentation on what these fields are? > > What is the difference between Waiter2 and Waiter3? > > If I understand this correctly, the updatedb (29537) on > cl031 is trying to go from shared -> exclusive while the > rm (28723) on cl030 is holding the glock shared and the > updatedb (29693) on cl030 is waiting to get the glock shared. > Looking at the stack traces, what I said above does not makes sense. So now I am really confused. updatedb should only need the glock shared since it is only doing a readdir. But the stack trace on the rm cl030 shows that it is in readdir as well. So what does the Request, gh_state = exclusive mean? Still looks like it is trying to go exclusive, but I cannot tell why. Thanks for any help, Daniel