Re: [Linux-cluster] umount hung single node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 09, 2005 at 05:26:34PM -0800, Daniel McNeil wrote:
> I upgraded to 2.6.11 and the latest cvs a few days ago.
> I started my tests on Mar  7 16:01 and they hung on Mar  9 12:34.
> This is a 3 node cluster, but the test that hung only has 1
> node with gfs mounted and it is trying to unmount:
> 
> root     12500 12494  0 12:34 ?        00:00:01 umount /gfs_stripe5
> 
> $ cat /proc/12500/wchan
> .text.lock.ast
> 
> dlm_astd is spinning as top shows:
> 
> 12302 root      20  -5     0    0    0 R 99.9  0.0 280:28.23 dlm_astd
> 
> I've attached the output from /proc/cluster/dlm_debug.
> 
> Is there any other useful data to pull off the node to see what is
> going on?

Were you using clvm (or specifically, was this node running clvmd)?  If
not, then the unmount would mean stopping all the dlm threads.  That's
something we seldom do in our testing because clvmd is always still using
the dlm.  Starting clvmd on your nodes, even if you don't use it, would
avoid unmount stopping dlm_astd which may avert the problem.

I just ran across a possibly related problem where kthread_stop() couldn't
stop dlm_astd.  dlm_astd was in wait_event_interruptible() instead of
spinning, though.  The fix was to simply get rid of the unnecessary
wait_queue and the wait_event.  I'm hoping that might fix the problem
you're seeing, too.  I've attached the patch.

-- 
Dave Teigland  <teigland@xxxxxxxxxx>
Index: ast.c
===================================================================
RCS file: /cvs/cluster/cluster/dlm-kernel/src/ast.c,v
retrieving revision 1.24
diff -u -r1.24 ast.c
--- ast.c	11 Mar 2005 08:15:59 -0000	1.24
+++ ast.c	18 Mar 2005 07:11:28 -0000
@@ -36,7 +36,6 @@
 
 static struct list_head		ast_queue;
 static struct semaphore		ast_queue_lock;
-static wait_queue_head_t	astd_waitchan;
 static struct task_struct *	astd_task;
 static unsigned long		astd_wakeflags;
 static struct semaphore		astd_running;
@@ -565,7 +572,7 @@
 static void lockqueue_timer_fn(unsigned long arg)
 {
 	set_bit(WAKE_TIMER, &astd_wakeflags);
-	wake_up(&astd_waitchan);
+	wake_up_process(astd_task);
 }
 
 /* 
@@ -588,7 +595,10 @@
 		  jiffies + ((dlm_config.lock_timeout >> 1) * HZ));
 
 	while (!kthread_should_stop()) {
-		wchan_cond_sleep_intr(astd_waitchan, !test_bit(WAKE_ASTS, &astd_wakeflags));
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!test_bit(WAKE_ASTS, &astd_wakeflags))
+			schedule();
+		set_current_state(TASK_RUNNING);
 
 		down(&astd_running);
 		if (test_and_clear_bit(WAKE_ASTS, &astd_wakeflags))
@@ -612,7 +622,7 @@
 {
 	if (!no_asts()) {
 		set_bit(WAKE_ASTS, &astd_wakeflags);
-		wake_up(&astd_waitchan);
+		wake_up_process(astd_task);
 	}
 }
 
@@ -623,7 +633,6 @@
 
 	INIT_LIST_HEAD(&ast_queue);
 	init_MUTEX(&ast_queue_lock);
-	init_waitqueue_head(&astd_waitchan);
 	init_MUTEX(&astd_running);
 
 	p = kthread_run(dlm_astd, NULL, "dlm_astd");
@@ -637,7 +646,6 @@
 void astd_stop(void)
 {
 	kthread_stop(astd_task);
-	wake_up(&astd_waitchan);
 }
 
 void astd_suspend(void)

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux