I turned on CONFIG_DEBUG_SLAB was hitting slab corruption. I added a debug magic number to the rw_semaphore struct and then added a BUG_ON() in the down_write() and up_write() and got a stack trace showing the up_write() was referencing free memory: EIP is at dlm_unlock_stage2+0x126/0x2a0 [dlm] eax: 00000025 ebx: e5b34d98 ecx: c0456c0c edx: 00000286 esi: e5b34d34 edi: e5b2eb28 ebp: ea3f2e64 esp: ea3f2e48 ds: 007b es: 007b ss: 0068 Process gfs_glockd (pid: 3693, threadinfo=ea3f2000 task=e8c20ed0) Stack: e5b2eb28 00000005 00000000 00000000 e5b2eb28 e7ae85a4 e7ae8654 ea3f2ea0 f8b305dd e5b2eb28 e5b34d34 00000000 000e00a3 00040000 00000000 00000000 e5b34dcd ffffffea e5b34d34 00000000 e5b40678 e5b33d38 ea3f2ecc f8b4fcd6 Call Trace: [<c010626f>] show_stack+0x7f/0xa0 [<c010641e>] show_registers+0x15e/0x1d0 [<c010663e>] die+0xfe/0x190 [<c0105e59>] error_code+0x2d/0x38 [<f8b305dd>] dlm_unlock+0x2ad/0x3c0 [dlm] [<f8b4fcd6>] do_dlm_unlock+0x86/0x120 [lock_dlm] [<f8b50138>] lm_dlm_unlock+0x18/0x30 [lock_dlm] [<f8af2ba3>] gfs_glock_drop_th+0x93/0x1a0 [gfs] [<f8af1e0b>] rq_demote+0xbb/0xe0 [gfs] [<f8af1f18>] run_queue+0x88/0xe0 [gfs] [<f8af208b>] unlock_on_glock+0x2b/0x40 [gfs] [<f8af49d2>] gfs_reclaim_glock+0x132/0x1b0 [gfs] [<f8ae46ea>] gfs_glockd+0x11a/0x130 [gfs] [<c0103325>] kernel_thread_helper+0x5/0x10 Code: 2a 89 d8 ba ff ff 00 00 f0 0f c1 10 0f 85 4c 12 00 00 ba 9a 3a b4 f8 e8 c9 bf 6e c7 e8 54 b8 ff ff 83 c4 10 31 c0 5b 5e 5f 5d c3 <0f> 0b e8 00 2e 3a b4 f8 eb cc ba 88 3a b4 f8 89 d8 e8 a4 bf 6e Looking through the code, I found when that a call to queue_ast(lkb, AST_COMP | AST_DEL, 0); will lead to process_asts() which will free the dlm_rsb. So there is a race where the rsb can be freed BEFORE we do the up_write(rsb->res_lock); The fix is simple, do the up_write() before the queue_ast(). Here's a patch that fixed this problem: --- cluster.orig/dlm-kernel/src/locking.c 2004-12-09 15:23:13.789834384 -0800 +++ cluster/dlm-kernel/src/locking.c 2004-12-09 15:24:51.809742940 -0800 @@ -687,8 +687,13 @@ void dlm_lock_stage3(struct dlm_lkb *lkb lkb->lkb_retstatus = -EAGAIN; if (lkb->lkb_lockqueue_flags & DLM_LKF_NOQUEUEBAST) send_blocking_asts_all(rsb, lkb); + /* + * up the res_lock before queueing ast, since the AST_DEL will + * cause the rsb to be released and that can happen anytime. + */ + up_write(&rsb->res_lock); queue_ast(lkb, AST_COMP | AST_DEL, 0); - goto out; + return; } /* @@ -888,7 +893,13 @@ int dlm_unlock_stage2(struct dlm_lkb *lk lkb->lkb_retstatus = flags & DLM_LKF_CANCEL ? -DLM_ECANCEL:-DLM_EUNLOCK; if (!remote) { + /* + * up the res_lock before queueing ast, since the AST_DEL will + * cause the rsb to be released and that can happen anytime. + */ + up_write(&rsb->res_lock); queue_ast(lkb, AST_COMP | AST_DEL, 0); + goto out2; } else { up_write(&rsb->res_lock); release_lkb(rsb->res_ls, lkb); This did not fix my other hang, I'll try out Patrick's simple patch and see what happens. Thanks, Daniel