Hi Wendy, thanks for your answer! The system is still in the deadlock state, so I hopefully can collect all information you need :-) (you'll find the crash output below) Thanks, Mark > > we have the following deadlock situation: > > > > 2 node cluster consisting of node1 and node2. > > /usr/local is placed on a GFS filesystem mounted on both nodes. > > Lockmanager is dlm. > > We are using RHEL4u4 > > > > a strace to ls -l /usr/local/swadmin/mnx/xml ends up in > > lstat("/usr/local/swadmin/mnx/xml", > > > > This happens on both cluster nodes. > > > > All processes trying to access the directory /usr/local/swadmin/mnx/xml > > are in "Waiting for IO (D)" state. I.e. system load is at about 400 ;-) > > > > Any ideas ? > > Quickly browsing this, look to me that process with pid=5856 got stuck. > That process had the file or directory (ino number 627732 - probably > /usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for > it. The faulty process was apparently in the middle of obtaining another > exclusive lock (and almost got it). We need to know where pid=5856 was > stuck at that time. If this occurs again, could you use "crash" to back > trace that process and show us the output ? Here's the crash output: crash> bt 5856 PID: 5856 TASK: 10bd26427f0 CPU: 0 COMMAND: "java" #0 [10bd20cfbc8] schedule at ffffffff8030a1d1 #1 [10bd20cfca0] wait_for_completion at ffffffff8030a415 #2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e #3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce #4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41 #5 [10bd20cfea0] vfs_unlink at ffffffff801889fa #6 [10bd20cfed0] sys_unlink at ffffffff80188b19 #7 [10bd20cff30] filp_close at ffffffff80178e48 #8 [10bd20cff50] error_exit at ffffffff80110d91 RIP: 0000002a9593f649 RSP: 0000007fbfffbca0 RFLAGS: 00010206 RAX: 0000000000000057 RBX: ffffffff8011026a RCX: 0000002a9cc9c870 RDX: 0000002ae5989000 RSI: 0000002a962fa3a8 RDI: 0000002ae5989000 RBP: 0000000000000000 R8: 0000002a9630abb0 R9: 0000000000000ffc R10: 0000002a9630abc0 R11: 0000000000000206 R12: 0000000040115700 R13: 0000002ae23294b0 R14: 0000007fbfffc300 R15: 0000002ae5989000 ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b > > a lockdump analysis with the decipher_lockstate_dump and parse_lockdump > > shows the following output (The whole file is too large for the > > mailing-list): > > > > Entries: 101939 > > Glocks: 60112 > > PIDs: 751 > > > > 4 chain: > > lockdump.node1.dec Glock (inode[2], 1114343) > > gl_flags = lock[1] > > gl_count = 5 > > gl_state = shared[3] > > req_gh = yes > > req_bh = yes > > lvb_count = 0 > > object = yes > > new_le = no > > incore_le = no > > reclaim = no > > aspace = 1 > > ail_bufs = no > > Request > > owner = 5856 > > gh_state = exclusive[1] > > gh_flags = try[0] local_excl[5] async[6] > > error = 0 > > gh_iflags = promote[1] > > Waiter3 > > owner = 5856 > > gh_state = exclusive[1] > > gh_flags = try[0] local_excl[5] async[6] > > error = 0 > > gh_iflags = promote[1] > > Inode: busy > > lockdump.node2.dec Glock (inode[2], 1114343) > > gl_flags = > > gl_count = 2 > > gl_state = unlocked[0] > > req_gh = no > > req_bh = no > > lvb_count = 0 > > object = yes > > new_le = no > > incore_le = no > > reclaim = no > > aspace = 0 > > ail_bufs = no > > Inode: > > num = 1114343/1114343 > > type = regular[1] > > i_count = 1 > > i_flags = > > vnode = yes > > lockdump.node1.dec Glock (inode[2], 627732) > > gl_flags = dirty[5] > > gl_count = 379 > > gl_state = exclusive[1] > > req_gh = no > > req_bh = no > > lvb_count = 0 > > object = yes > > new_le = no > > incore_le = no > > reclaim = no > > aspace = 58 > > ail_bufs = no > > Holder > > owner = 5856 > > gh_state = exclusive[1] > > gh_flags = try[0] local_excl[5] async[6] > > error = 0 > > gh_iflags = promote[1] holder[6] first[7] > > Waiter2 > > owner = none[-1] > > gh_state = shared[3] > > gh_flags = try[0] > > error = 0 > > gh_iflags = demote[2] alloced[4] dealloc[5] > > Waiter3 > > owner = 32753 > > gh_state = shared[3] > > gh_flags = any[3] > > error = 0 > > gh_iflags = promote[1] > > [...loads of Waiter3 entries...] > > Waiter3 > > owner = 4566 > > gh_state = shared[3] > > gh_flags = any[3] > > error = 0 > > gh_iflags = promote[1] > > Inode: busy > > lockdump.node2.dec Glock (inode[2], 627732) > > gl_flags = lock[1] > > gl_count = 375 > > gl_state = unlocked[0] > > req_gh = yes > > req_bh = yes > > lvb_count = 0 > > object = yes > > new_le = no > > incore_le = no > > reclaim = no > > aspace = 0 > > ail_bufs = no > > Request > > owner = 20187 > > gh_state = shared[3] > > gh_flags = any[3] > > error = 0 > > gh_iflags = promote[1] > > Waiter3 > > owner = 20187 > > gh_state = shared[3] > > gh_flags = any[3] > > error = 0 > > gh_iflags = promote[1] > > [...loads of Waiter3 entries...] > > Waiter3 > > owner = 10460 > > gh_state = shared[3] > > gh_flags = any[3] > > error = 0 > > gh_iflags = promote[1] > > Inode: busy > > 2 requests > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Mark Hlawatschek http://www.atix.de/ http://www.open-sharedroot.org/ ** Visit us at CeBIT 2007 in Hannover/Germany ** ** in Hall 5, Booth G48/2 (15.-21. of March) ** ** ATIX - Ges. fuer Informationstechnologie und Consulting mbH Einsteinstr. 10 - 85716 Unterschleissheim - Germany -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster