I reran up test with my dlm patch to up_write() before the queue_ast() patch and Patrick's up_write() before remote_stage() patch. I am running with SLAB_DEBUG and I am not hitting any referencing free memory problem any more. Now it looks like I am hitting some dlm hang. This is a 2 node hang with cl030 and cl032. On cl030: rm D C1716F9C 0 31078 31054 (NOTLB) da393b0c 00000086 f5dae710 c1716f9c 000046f3 c1716f9c 00000020 00000004 f5cf2318 55a020b1 000046f3 a5871cab c188e304 00000286 ba5871ca da393b08 c1716f60 00000001 0000fde0 55a24b8b 000046f3 c1b2a090 c1b2a1f8 30313765 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8af2dab>] glock_wait_internal+0x3b/0x270 [gfs] [<f8af3316>] gfs_glock_nq+0x86/0x130 [gfs] [<f8af3ef6>] gfs_glock_nq_m+0x166/0x1a0 [gfs] [<f8ae33af>] do_strip+0x22f/0x5c0 [gfs] [<f8ae2f9e>] recursive_scan+0xbe/0x2a0 [gfs] [<f8ae3d65>] gfs_shrink+0x3c5/0x490 [gfs] [<f8af762d>] inode_dealloc+0x18d/0x2b0 [gfs] [<f8af77eb>] inode_dealloc_init+0x9b/0xe0 [gfs] [<f8b1a78d>] gfs_unlinked_limit+0x6d/0xd0 [gfs] [<f8b0c779>] gfs_unlink+0x39/0x190 [gfs] [<c017265a>] vfs_unlink+0x18a/0x220 [<c01727e8>] sys_unlink+0xf8/0x160 [<c010537d>] sysenter_past_esp+0x52/0x71 And cl032: rm D 00000008 0 23528 23520 (NOTLB) e81eeb0c 00000086 e81eeafc 00000008 00000001 00000000 00000008 00000004 ce619378 c024c8e2 0000000d bb775aa9 c1aa3188 00000286 ddbbad54 e81eeb08 c170ef60 00000000 000100cc 5e2b3227 00008c7f e2f85050 e2f851b8 30306539 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8af2dab>] glock_wait_internal+0x3b/0x270 [gfs] [<f8af3316>] gfs_glock_nq+0x86/0x130 [gfs] [<f8af3ef6>] gfs_glock_nq_m+0x166/0x1a0 [gfs] [<f8ae33af>] do_strip+0x22f/0x5c0 [gfs] [<f8ae2f9e>] recursive_scan+0xbe/0x2a0 [gfs] [<f8ae3d65>] gfs_shrink+0x3c5/0x490 [gfs] [<f8af762d>] inode_dealloc+0x18d/0x2b0 [gfs] [<f8af77eb>] inode_dealloc_init+0x9b/0xe0 [gfs] [<f8b1a78d>] gfs_unlinked_limit+0x6d/0xd0 [gfs] [<f8b0c779>] gfs_unlink+0x39/0x190 [gfs] [<c017265a>] vfs_unlink+0x18a/0x220 [<c01727e8>] sys_unlink+0xf8/0x160 [<c010537d>] sysenter_past_esp+0x52/0x71 I ran the decipher and parse scrips from gfs_tool/ directory and it looks like the problem is on cl032. Here's the parse output: cl032.ld.decipher Glock (rgrp[3], 17) gl_flags = lock[1] dirty[5] gl_count = 6 gl_state = exclusive[1] lvb_count = 1 object = yes aspace = 5 reclaim = no Request owner = none[-1] gh_state = unlocked[0] gh_flags = try[0] error = 0 gh_iflags = demote[2] alloced[4] dealloc[5] Waiter2 owner = none[-1] gh_state = unlocked[0] gh_flags = try[0] error = 0 gh_iflags = demote[2] alloced[4] dealloc[5] Waiter3 owner = 23528 gh_state = exclusive[1] gh_flags = local_excl[5] error = 0 gh_iflags = promote[1] Looking that the output from /proc/cluster/dlm_locks, this lock looks interesting: Resource d6e2a5cc (parent 00000000). Name (len=24) " 3 11" Local Copy, Master is node 3 Granted Queue 0022031d NL Master: 001b004c Conversion Queue Waiting Queue 0036020c -- (EX) Master: 00330164 LQ: 0,0x8 Is there an easy way to know which resource name matches with glock? AFAIKT, the glock is waiting for the unlock to happen. The DLM (if this is the matching dlm lock) is NL waiting to grant to EX, but it not doing it. Thoughts? Is my analysis correct? The full info is available here: http://developer.osdl.org/daniel/GFS/rm.hang.10dec2004/ Daniel PS Is this the fun part? :)