Re: Raid5 hang in 3.14.19

NeilBrown <neilb@xxxxxxx> · Wed, 1 Oct 2014 08:54:09 +1000

On Tue, 30 Sep 2014 16:21:32 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx>
wrote:

> On 09/29/2014 04:59 PM, NeilBrown wrote:
> > On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx>
> > wrote:
> >
> >> On 09/28/2014 11:08 PM, NeilBrown wrote:
> >>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx>
> >>> wrote:
> >>>
> >>>> On 09/28/2014 09:25 PM, NeilBrown wrote:
> >>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Neil,
> >>>>>>
> >>>>>> I found something that looks similar to the problem described in
> >>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> >>>>>>
> >>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> >>>>>>
> >>>>>> on this array:
> >>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
> >>>>>>           104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
> >>>>>>           bitmap: 1/5 pages [4KB], 2048KB chunk
> >>>>>>
> >>>>>> I was running a test doing parallel kernel builds, read/write loops, and
> >>>>>> disk add / remove / check loops,
> >>>>>> on both this array and a raid1 array.
> >>>>>>
> >>>>>> I was trying to stress test your recent raid1 fixes, which went well,
> >>>>>> but then after 5 days,
> >>>>>> the raid5 array hung up with this in dmesg:
> >>>>> I think this is different to the workqueue problem you mentioned, though as I
> >>>>> don't know exactly what caused either I cannot be certain.
> >>>>>
> >>>>>     From the data you provided it looks like everything is waiting on
> >>>>> get_active_stripe(), or on a process that is waiting on that.
> >>>>> That seems pretty common whenever anything goes wrong in raid5 :-(
> >>>>>
> >>>>> The md3_raid5 task is listed as blocked, but not stack trace is given.
> >>>>> If the machine is still in the state, then
> >>>>>
> >>>>>     cat /proc/1698/stack
> >>>>>
> >>>>> might be useful.
> >>>>> (echo t > /proc/sysrq-trigger is always a good idea)
> >>>> Might this help? I believe the array was doing a "check" when things
> >>>> hung up.
> >>> It looks like it was trying to start doing a 'check'.
> >>> The 'resync' thread hadn't been started yet.
> >>> What is 'kthreadd' doing?
> >>> My guess is that it is in try_to_free_pages() waiting for writeout
> >>> for some xfs file page onto the md array ... which won't progress until
> >>> the thread gets started.
> >>>
> >>> That would suggest that we need an async way to start threads...
> >>>
> >>> Thanks,
> >>> NeilBrown
> >>>
> >> I suspect your guess is correct:
> > Thanks for the confirmation.
> >
> > I'm thinking of something like that.  Very basic suggestion suggests it
> > instantly crash.
> >
> > If you were to apply this patch and run your test for a week or two,  that
> > would increase my confidence (though of course testing doesn't prove the
> > absence of bugs....)
> >
> > Thanks,
> > NeilBrown
> >
> 
> Neil,
> 
> It locked up already, but in a different way.
> In this case it was triggered by a re-add in a fail / remove / add loop.
> 
> It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync 
> before it got fully started.
> 
> 
> INFO: task md3_raid5:1698 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac
>   c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088
>   00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b
> Call Trace:
>   [<c106d92d>] ? __enqueue_entity+0x6d/0x80
>   [<c1072683>] ? enqueue_task_fair+0x2d3/0x660
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c1069cb0>] ? default_wake_function+0x10/0x20
>   [<c1065698>] ? update_rq_clock.part.92+0x18/0x50
>   [<c1067a65>] ? check_preempt_curr+0x65/0x90
>   [<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<c105b241>] kthread_stop+0x41/0xb0
>   [<c14540e1>] md_unregister_thread+0x31/0x40
>   [<c145a799>] md_reap_sync_thread+0x19/0x140
>   [<c145aba8>] md_check_recovery+0xe8/0x480
>   [<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
>   [<c104a022>] ? try_to_del_timer_sync+0x42/0x60
>   [<c104a081>] ? del_timer_sync+0x41/0x50
>   [<c153dbdd>] ? schedule_timeout+0xfd/0x1c0
>   [<c10497c0>] ? detach_if_pending+0xa0/0xa0
>   [<c10797b4>] ? finish_wait+0x44/0x60
>   [<c1454098>] md_thread+0xe8/0x100
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task kworker/0:0:12491 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea
>   c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c
>   c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c
> Call Trace:
>   [<c10795ff>] ? __wake_up+0x3f/0x50
>   [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c12b1192>] ? blk_finish_plug+0x12/0x40
>   [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
>   [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
>   [<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs]
>   [<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs]
>   [<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs]
>   [<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs]
>   [<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs]
>   [<c111be1e>] ? kmem_cache_alloc+0xae/0xf0
>   [<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs]
>   [<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs]
>   [<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs]
>   [<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0
>   [<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs]
>   [<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs]
>   [<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs]
>   [<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs]
>   [<c105521c>] process_one_work+0x10c/0x340
>   [<c1055d31>] worker_thread+0x101/0x330
>   [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task xfs_fsr:24262 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000
>   c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0
>   c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0
> Call Trace:
>   [<c10795ff>] ? __wake_up+0x3f/0x50
>   [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c12b1192>] ? blk_finish_plug+0x12/0x40
>   [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
>   [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
>   [<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs]
>   [<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs]
>   [<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs]
>   [<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs]
>   [<f3c9a092>] xfs_iget+0x192/0x580 [xfs]
>   [<f3caa219>] ? kmem_free+0x19/0x50 [xfs]
>   [<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs]
>   [<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs]
>   [<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs]
>   [<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs]
>   [<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs]
>   [<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs]
>   [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
>   [<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs]
>   [<c106bf9b>] ? __update_cpu_load+0xab/0xd0
>   [<c105dc58>] ? hrtimer_forward+0xa8/0x1b0
>   [<c12d4050>] ? timerqueue_add+0x50/0xb0
>   [<c108d143>] ? ktime_get+0x53/0xe0
>   [<c1094035>] ? clockevents_program_event+0x95/0x130
>   [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
>   [<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0
>   [<c1095699>] ? tick_program_event+0x29/0x30
>   [<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0
>   [<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30
>   [<c1134c08>] SyS_ioctl+0x78/0x80
>   [<c1540fa8>] syscall_call+0x7/0x7
> INFO: task md3_resync:24817 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0
>   c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001
>   00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292
> Call Trace:
>   [<c12adbc5>] ? queue_unplugged+0x45/0x90
>   [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
>   [<c153e893>] schedule+0x23/0x60
>   [<c145782d>] md_do_sync+0x8dd/0x1010
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c1454098>] md_thread+0xe8/0x100
>   [<c107945f>] ? __wake_up_locked+0x1f/0x30
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> 
> Again, the blocked task info is edited for space; hopefully these are 
> the important ones,
> but I've got more if it helps.

Thanks for the testing!  You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing.  It doesn't.

If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.

This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 211a615ae415..3df888b80e76 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7850,7 +7850,6 @@ void md_check_recovery(struct mddev *mddev)
 
 	if (mddev_trylock(mddev)) {
 		int spares = 0;
-		bool sync_starting = false;
 
 		if (mddev->ro) {
 			/* On a read-only array we can:
@@ -7914,7 +7913,7 @@ void md_check_recovery(struct mddev *mddev)
 
 		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
-			goto unlock;
+			goto not_running;
 		/* no recovery is running.
 		 * remove any failed drives, then
 		 * add spares if possible.
@@ -7926,7 +7925,7 @@ void md_check_recovery(struct mddev *mddev)
 			if (mddev->pers->check_reshape == NULL ||
 			    mddev->pers->check_reshape(mddev) != 0)
 				/* Cannot proceed */
-				goto unlock;
+				goto not_running;
 			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 		} else if ((spares = remove_and_add_spares(mddev, NULL))) {
@@ -7939,7 +7938,7 @@ void md_check_recovery(struct mddev *mddev)
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 		} else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 			/* nothing to be done ... */
-			goto unlock;
+			goto not_running;
 
 		if (mddev->pers->sync_request) {
 			if (spares) {
@@ -7951,18 +7950,18 @@ void md_check_recovery(struct mddev *mddev)
 			}
 			INIT_WORK(&mddev->del_work, md_start_sync);
 			queue_work(md_misc_wq, &mddev->del_work);
-			sync_starting = true;
+			goto unlock;
 		}
-	unlock:
-		wake_up(&mddev->sb_wait);
-
-		if (!mddev->sync_thread && !sync_starting) {
+	not_running:
+		if (!mddev->sync_thread) {
 			clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
 			if (test_and_clear_bit(MD_RECOVERY_RECOVER,
 					       &mddev->recovery))
 				if (mddev->sysfs_action)
 					sysfs_notify_dirent_safe(mddev->sysfs_action);
 		}
+	unlock:
+		wake_up(&mddev->sb_wait);
 		mddev_unlock(mddev);
 	}
 }
Attachment:
pgp9FU45E6UNi.pgp

Description: OpenPGP digital signature