On Wed, Mar 09, 2005 at 05:26:34PM -0800, Daniel McNeil wrote: > I upgraded to 2.6.11 and the latest cvs a few days ago. > I started my tests on Mar 7 16:01 and they hung on Mar 9 12:34. > This is a 3 node cluster, but the test that hung only has 1 > node with gfs mounted and it is trying to unmount: > > root 12500 12494 0 12:34 ? 00:00:01 umount /gfs_stripe5 > > $ cat /proc/12500/wchan > .text.lock.ast > > dlm_astd is spinning as top shows: > > 12302 root 20 -5 0 0 0 R 99.9 0.0 280:28.23 dlm_astd > > I've attached the output from /proc/cluster/dlm_debug. > > Is there any other useful data to pull off the node to see what is > going on? Were you using clvm (or specifically, was this node running clvmd)? If not, then the unmount would mean stopping all the dlm threads. That's something we seldom do in our testing because clvmd is always still using the dlm. Starting clvmd on your nodes, even if you don't use it, would avoid unmount stopping dlm_astd which may avert the problem. I just ran across a possibly related problem where kthread_stop() couldn't stop dlm_astd. dlm_astd was in wait_event_interruptible() instead of spinning, though. The fix was to simply get rid of the unnecessary wait_queue and the wait_event. I'm hoping that might fix the problem you're seeing, too. I've attached the patch. -- Dave Teigland <teigland@xxxxxxxxxx>
Index: ast.c =================================================================== RCS file: /cvs/cluster/cluster/dlm-kernel/src/ast.c,v retrieving revision 1.24 diff -u -r1.24 ast.c --- ast.c 11 Mar 2005 08:15:59 -0000 1.24 +++ ast.c 18 Mar 2005 07:11:28 -0000 @@ -36,7 +36,6 @@ static struct list_head ast_queue; static struct semaphore ast_queue_lock; -static wait_queue_head_t astd_waitchan; static struct task_struct * astd_task; static unsigned long astd_wakeflags; static struct semaphore astd_running; @@ -565,7 +572,7 @@ static void lockqueue_timer_fn(unsigned long arg) { set_bit(WAKE_TIMER, &astd_wakeflags); - wake_up(&astd_waitchan); + wake_up_process(astd_task); } /* @@ -588,7 +595,10 @@ jiffies + ((dlm_config.lock_timeout >> 1) * HZ)); while (!kthread_should_stop()) { - wchan_cond_sleep_intr(astd_waitchan, !test_bit(WAKE_ASTS, &astd_wakeflags)); + set_current_state(TASK_INTERRUPTIBLE); + if (!test_bit(WAKE_ASTS, &astd_wakeflags)) + schedule(); + set_current_state(TASK_RUNNING); down(&astd_running); if (test_and_clear_bit(WAKE_ASTS, &astd_wakeflags)) @@ -612,7 +622,7 @@ { if (!no_asts()) { set_bit(WAKE_ASTS, &astd_wakeflags); - wake_up(&astd_waitchan); + wake_up_process(astd_task); } } @@ -623,7 +633,6 @@ INIT_LIST_HEAD(&ast_queue); init_MUTEX(&ast_queue_lock); - init_waitqueue_head(&astd_waitchan); init_MUTEX(&astd_running); p = kthread_run(dlm_astd, NULL, "dlm_astd"); @@ -637,7 +646,6 @@ void astd_stop(void) { kthread_stop(astd_task); - wake_up(&astd_waitchan); } void astd_suspend(void)