On Tue, 30 Sep 2014 16:21:32 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx> wrote: > On 09/29/2014 04:59 PM, NeilBrown wrote: > > On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx> > > wrote: > > > >> On 09/28/2014 11:08 PM, NeilBrown wrote: > >>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx> > >>> wrote: > >>> > >>>> On 09/28/2014 09:25 PM, NeilBrown wrote: > >>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@xxxxxxxxxxxxx> > >>>>> wrote: > >>>>> > >>>>>> Hi Neil, > >>>>>> > >>>>>> I found something that looks similar to the problem described in > >>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th. > >>>>>> > >>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs. > >>>>>> > >>>>>> on this array: > >>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0] > >>>>>> 104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU] > >>>>>> bitmap: 1/5 pages [4KB], 2048KB chunk > >>>>>> > >>>>>> I was running a test doing parallel kernel builds, read/write loops, and > >>>>>> disk add / remove / check loops, > >>>>>> on both this array and a raid1 array. > >>>>>> > >>>>>> I was trying to stress test your recent raid1 fixes, which went well, > >>>>>> but then after 5 days, > >>>>>> the raid5 array hung up with this in dmesg: > >>>>> I think this is different to the workqueue problem you mentioned, though as I > >>>>> don't know exactly what caused either I cannot be certain. > >>>>> > >>>>> From the data you provided it looks like everything is waiting on > >>>>> get_active_stripe(), or on a process that is waiting on that. > >>>>> That seems pretty common whenever anything goes wrong in raid5 :-( > >>>>> > >>>>> The md3_raid5 task is listed as blocked, but not stack trace is given. > >>>>> If the machine is still in the state, then > >>>>> > >>>>> cat /proc/1698/stack > >>>>> > >>>>> might be useful. > >>>>> (echo t > /proc/sysrq-trigger is always a good idea) > >>>> Might this help? I believe the array was doing a "check" when things > >>>> hung up. > >>> It looks like it was trying to start doing a 'check'. > >>> The 'resync' thread hadn't been started yet. > >>> What is 'kthreadd' doing? > >>> My guess is that it is in try_to_free_pages() waiting for writeout > >>> for some xfs file page onto the md array ... which won't progress until > >>> the thread gets started. > >>> > >>> That would suggest that we need an async way to start threads... > >>> > >>> Thanks, > >>> NeilBrown > >>> > >> I suspect your guess is correct: > > Thanks for the confirmation. > > > > I'm thinking of something like that. Very basic suggestion suggests it > > instantly crash. > > > > If you were to apply this patch and run your test for a week or two, that > > would increase my confidence (though of course testing doesn't prove the > > absence of bugs....) > > > > Thanks, > > NeilBrown > > > > Neil, > > It locked up already, but in a different way. > In this case it was triggered by a re-add in a fail / remove / add loop. > > It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync > before it got fully started. > > > INFO: task md3_raid5:1698 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac > c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088 > 00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b > Call Trace: > [<c106d92d>] ? __enqueue_entity+0x6d/0x80 > [<c1072683>] ? enqueue_task_fair+0x2d3/0x660 > [<c153e893>] schedule+0x23/0x60 > [<c153dc25>] schedule_timeout+0x145/0x1c0 > [<c1069cb0>] ? default_wake_function+0x10/0x20 > [<c1065698>] ? update_rq_clock.part.92+0x18/0x50 > [<c1067a65>] ? check_preempt_curr+0x65/0x90 > [<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120 > [<c153effb>] wait_for_common+0x9b/0x110 > [<c1069ca0>] ? wake_up_process+0x40/0x40 > [<c153f087>] wait_for_completion+0x17/0x20 > [<c105b241>] kthread_stop+0x41/0xb0 > [<c14540e1>] md_unregister_thread+0x31/0x40 > [<c145a799>] md_reap_sync_thread+0x19/0x140 > [<c145aba8>] md_check_recovery+0xe8/0x480 > [<f3dc3a10>] raid5d+0x20/0x4c0 [raid456] > [<c104a022>] ? try_to_del_timer_sync+0x42/0x60 > [<c104a081>] ? del_timer_sync+0x41/0x50 > [<c153dbdd>] ? schedule_timeout+0xfd/0x1c0 > [<c10497c0>] ? detach_if_pending+0xa0/0xa0 > [<c10797b4>] ? finish_wait+0x44/0x60 > [<c1454098>] md_thread+0xe8/0x100 > [<c1079990>] ? __wake_up_sync+0x20/0x20 > [<c1453fb0>] ? md_start_sync+0xb0/0xb0 > [<c105ae21>] kthread+0xa1/0xc0 > [<c15418b7>] ret_from_kernel_thread+0x1b/0x28 > [<c105ad80>] ? kthread_create_on_node+0x110/0x110 > INFO: task kworker/0:0:12491 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea > c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c > c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c > Call Trace: > [<c10795ff>] ? __wake_up+0x3f/0x50 > [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30 > [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456] > [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456] > [<c153e893>] schedule+0x23/0x60 > [<c153dc25>] schedule_timeout+0x145/0x1c0 > [<c12b1192>] ? blk_finish_plug+0x12/0x40 > [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs] > [<c153effb>] wait_for_common+0x9b/0x110 > [<c1069ca0>] ? wake_up_process+0x40/0x40 > [<c153f087>] wait_for_completion+0x17/0x20 > [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs] > [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs] > [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs] > [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs] > [<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs] > [<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs] > [<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs] > [<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs] > [<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs] > [<c111be1e>] ? kmem_cache_alloc+0xae/0xf0 > [<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs] > [<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs] > [<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs] > [<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs] > [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs] > [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0 > [<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs] > [<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs] > [<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs] > [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs] > [<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs] > [<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs] > [<c105521c>] process_one_work+0x10c/0x340 > [<c1055d31>] worker_thread+0x101/0x330 > [<c1055c30>] ? manage_workers.isra.27+0x250/0x250 > [<c105ae21>] kthread+0xa1/0xc0 > [<c15418b7>] ret_from_kernel_thread+0x1b/0x28 > [<c105ad80>] ? kthread_create_on_node+0x110/0x110 > INFO: task xfs_fsr:24262 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000 > c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0 > c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0 > Call Trace: > [<c10795ff>] ? __wake_up+0x3f/0x50 > [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30 > [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456] > [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456] > [<c153e893>] schedule+0x23/0x60 > [<c153dc25>] schedule_timeout+0x145/0x1c0 > [<c12b1192>] ? blk_finish_plug+0x12/0x40 > [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs] > [<c153effb>] wait_for_common+0x9b/0x110 > [<c1069ca0>] ? wake_up_process+0x40/0x40 > [<c153f087>] wait_for_completion+0x17/0x20 > [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs] > [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs] > [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs] > [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs] > [<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs] > [<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs] > [<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs] > [<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs] > [<f3c9a092>] xfs_iget+0x192/0x580 [xfs] > [<f3caa219>] ? kmem_free+0x19/0x50 [xfs] > [<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs] > [<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs] > [<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs] > [<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs] > [<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs] > [<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs] > [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs] > [<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs] > [<c106bf9b>] ? __update_cpu_load+0xab/0xd0 > [<c105dc58>] ? hrtimer_forward+0xa8/0x1b0 > [<c12d4050>] ? timerqueue_add+0x50/0xb0 > [<c108d143>] ? ktime_get+0x53/0xe0 > [<c1094035>] ? clockevents_program_event+0x95/0x130 > [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs] > [<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0 > [<c1095699>] ? tick_program_event+0x29/0x30 > [<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0 > [<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30 > [<c1134c08>] SyS_ioctl+0x78/0x80 > [<c1540fa8>] syscall_call+0x7/0x7 > INFO: task md3_resync:24817 blocked for more than 120 seconds. > Tainted: P O 3.14.19fe-dirty #3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0 > c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001 > 00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292 > Call Trace: > [<c12adbc5>] ? queue_unplugged+0x45/0x90 > [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0 > [<c153e893>] schedule+0x23/0x60 > [<c145782d>] md_do_sync+0x8dd/0x1010 > [<c1079990>] ? __wake_up_sync+0x20/0x20 > [<c1453fb0>] ? md_start_sync+0xb0/0xb0 > [<c1454098>] md_thread+0xe8/0x100 > [<c107945f>] ? __wake_up_locked+0x1f/0x30 > [<c1453fb0>] ? md_start_sync+0xb0/0xb0 > [<c105ae21>] kthread+0xa1/0xc0 > [<c15418b7>] ret_from_kernel_thread+0x1b/0x28 > [<c105ad80>] ? kthread_create_on_node+0x110/0x110 > > Again, the blocked task info is edited for space; hopefully these are > the important ones, > but I've got more if it helps. Thanks for the testing! You have included enough information. I didn't really like that 'sync_starting' variable when I wrote the patch, but it seemed do the right thing. It doesn't. If md_check_recovery() runs again immediately after scheduling the sync thread to run, it will not have set sync_starting but will find ->sync_thread is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that flag is still clear and ->sync_thread is not NULL so it will try to stop the thread, which deadlocks. This patch on top of what you have should fix it... but I might end up redoing the logic a bit. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 211a615ae415..3df888b80e76 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7850,7 +7850,6 @@ void md_check_recovery(struct mddev *mddev) if (mddev_trylock(mddev)) { int spares = 0; - bool sync_starting = false; if (mddev->ro) { /* On a read-only array we can: @@ -7914,7 +7913,7 @@ void md_check_recovery(struct mddev *mddev) if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) - goto unlock; + goto not_running; /* no recovery is running. * remove any failed drives, then * add spares if possible. @@ -7926,7 +7925,7 @@ void md_check_recovery(struct mddev *mddev) if (mddev->pers->check_reshape == NULL || mddev->pers->check_reshape(mddev) != 0) /* Cannot proceed */ - goto unlock; + goto not_running; set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if ((spares = remove_and_add_spares(mddev, NULL))) { @@ -7939,7 +7938,7 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) /* nothing to be done ... */ - goto unlock; + goto not_running; if (mddev->pers->sync_request) { if (spares) { @@ -7951,18 +7950,18 @@ void md_check_recovery(struct mddev *mddev) } INIT_WORK(&mddev->del_work, md_start_sync); queue_work(md_misc_wq, &mddev->del_work); - sync_starting = true; + goto unlock; } - unlock: - wake_up(&mddev->sb_wait); - - if (!mddev->sync_thread && !sync_starting) { + not_running: + if (!mddev->sync_thread) { clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) if (mddev->sysfs_action) sysfs_notify_dirent_safe(mddev->sysfs_action); } + unlock: + wake_up(&mddev->sb_wait); mddev_unlock(mddev); } }
Attachment:
pgp9FU45E6UNi.pgp
Description: OpenPGP digital signature