David, Even with your dlm-astd-wake.patch my nodes still hung in umount. I put in code to spin doing write_trylock() in the dlm code and print out a stack trace and break out if it spun too long. Here's the stack trace from my added dump_stack() [<c010423e>] dump_stack+0x1e/0x30 [<f8aeb64b>] write_lock_dir+0x5b/0x70 [dlm] [<f8aeb6a8>] dlm_dir_remove+0x48/0x140 [dlm] [<f8afd712>] _release_rsb+0x162/0x2e0 [dlm] [<f8afd8a9>] release_rsb+0x19/0x20 [dlm] [<f8ae84c6>] process_asts+0xe6/0x200 [dlm] [<f8ae8f0b>] dlm_astd+0x1db/0x210 [dlm] [<c013325a>] kthread+0xba/0xc0 [<c01013c5>] kernel_thread_helper+0x5/0x10 after this the code oopsed: Unable to handle kernel paging request at virtual address 6b6b6b6b printing eip: f8aeb58e *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: lock_dlm dlm lock_nolock qla2200 qla2xxx gfs lock_harness cman dm_mod video CPU: 1 EIP: 0060:[<f8aeb58e>] Not tainted VLI EFLAGS: 00010202 (2.6.11) EIP is at search_bucket+0x1e/0x80 [dlm] eax: 000009ec ebx: 00000010 ecx: 6b6b6b6b edx: cd2c4000 esi: c96717ad edi: 0000007f ebp: cd80fee4 esp: cd80fed0 ds: 007b es: 007b ss: 0068 Process dlm_astd (pid: 29107, threadinfo=cd80f000 task=f6d53040) Stack: 6b6b6b6b e87520ac 00000010 c96717ad 0000007f cd80ff14 f8aeb6c2 e87520ac c96717ad 00000010 0000007f 00000000 00000001 e87520ac 00000001 c9671728 e87520ac cd80ff40 f8afd712 e87520ac 00000001 c96717ad 00000010 f8af490a Call Trace: [<c01041ff>] show_stack+0x7f/0xa0 [<c01043b2>] show_registers+0x162/0x1e0 [<c01045de>] die+0xfe/0x190 [<c0115892>] do_page_fault+0x3b2/0x6f2 [<c0103e57>] error_code+0x2b/0x30 [<f8aeb6c2>] dlm_dir_remove+0x62/0x140 [dlm] [<f8afd712>] _release_rsb+0x162/0x2e0 [dlm] [<f8afd8a9>] release_rsb+0x19/0x20 [dlm] [<f8ae84c6>] process_asts+0xe6/0x200 [dlm] [<f8ae8f0b>] dlm_astd+0x1db/0x210 [dlm] [<c013325a>] kthread+0xba/0xc0 [<c01013c5>] kernel_thread_helper+0x5/0x10 I have slab debug on, so the accessing 0x6b6b6b6b means we are accessing freed memory. Looking at the code, the problem is a race condition between dlm_astd() and release_lockspace(). dlm_astd can pull an lkb off the ast_queue and still be processing it while the release_lockspace() is running calls dlm_dir_clear() and then kfree()s ls->ls_dirtbl. When dlm_astd() calls release_rsb() it leads to a dlm_dir_remove() which accesses the freed ls_dirtbl which is freed. With slab debug, this leads a spinning write_lock() and a hung umount. My machines are 2 cpu systems which also might expose the race condition. The fix is below and is fairly simple, just do the astd_suspend() in release_lockspace() before the dlm_dir_clear() and kfree(). That way astd won't be process lkb on the astqueue will it is being freed. Here's the patch: --- dlm-kernel/src/lockspace.c.orig 2005-03-24 14:37:28.000000000 -0800 +++ dlm-kernel/src/lockspace.c 2005-03-24 14:42:37.000000000 -0800 @@ -477,6 +477,14 @@ static int release_lockspace(struct dlm_ remove_lockspace(ls); /* + * Suspend astd before doing the dlm_dir_clear() and kfree(), + * otherwise astd can be processing an ast which can call release_rsb() + * and then dlm_dir_remove() which references ls_dirtbl after + * it has been freed. + */ + astd_suspend(); + + /* * Free direntry structs. */ @@ -487,8 +495,6 @@ static int release_lockspace(struct dlm_ * Free all lkb's on lkbtbl[] lists. */ - astd_suspend(); - for (i = 0; i < ls->ls_lkbtbl_size; i++) { head = &ls->ls_lkbtbl[i].list; while (!list_empty(head)) { Thoughts? Daniel