On Mon, May 22 2017, Nix wrote: > [I think bcache is blameless here, but there may be some interaction, so > I've Cc:ed both lists. I'm fairly sure the problem is mddev_suspend() > being called from level_store() and scheduling after suspension has > happened, but I'm not sure what the fix might be.] > > So I have a box with three RAID arrays on the same physical > rotating-rust devices: > > - a RAID-0 /dev/md/transient atop /dev/sd*2 > - a RAID-5 /dev/md/fast, a bcache backing device (with disabled cache > for now) atop /dev/sd*3 > - a RAID-5 /dev/md/slow atop /dev/sd*4. > > The rootfs and everything the system needs to run, including mdadm, is > on /dev/md/fast. fs-wise, everything is using xfs except for > /dev/md/transient, which houses an ext4 (unjournalled, for speed over > robustness). > > I just got a new disk and grew /dev/md/transient to use it without > incident, then grew /dev/md/slow into a RAID-6: > > mdadm --add /dev/md/slow /dev/sdf4 > mdadm --grow /dev/md/slow --level=6 --backup-file=/.transient/backup-slow > > This worked fine. > > But the *exact same command* applied to /dev/md/fast hangs: > > mdadm --add /dev/md/fast /dev/sdf3 > mdadm --grow /dev/md/fast --level=6 --backup-file=/.transient/backup-fast > [hang here] > > At this point, all disk I/O to /dev/md/fast appears suspended: because / > is on it, this leaves me rather unable to do anything but reboot. > Thankfully the deadlock seemingly happens before anything crucial is > done, and the machine reboots and brigns up /dev/md/fast as an > unreshaped raid5 array with a spare. > > Is this workflow impossible? Do you have to reshape the device the > rootfs is on from early userspace or something like that? If so... this > leaves a crucial mdadm running across the early userspace transition: > what do we do about the fs the backup file is on (since the transition > usually deletes everything)? Do we even *need* a backup file for a > reshape that's growing onto a spare, as the manpage suggests we do? Do > we simply have to leave the machine unavailable for days while the > reshape completes? (I hope not.) > > That hangcheck output I was able to capture by eye suggests a deadlock > in block I/O during a brief instant of suspended I/O to the device. The > first hang we see comes from bcache, but since it's in "none" mode this > is pretty clearly just a writeback (note the md_make_request at the top > there): > > [ 248.877539] Workqueue: bcache cached_dev_nodata > [ 248.888590] Call Trace: > [ 248.899337] __schedule+0x290/0x810 > [ 248.909745] schedule+0x36/0x80 > [ 248.920113] md_make_request+0x9c/0x220 > [ 248.930229] ? wake_atomic_t_function+0x60/0x60 > [ 248.940063] generic_make_request+0xfc/0x260 > [ 248.949803] cached_dev_nodata+0x28/0x80 > [ 248.959393] ? cached_dev_nodata+0x28/0x80 > [ 248.968998] process_one_work+0x1e3/0x440 > [ 248.978414] worker_thread+0x48/0x4d0 > [ 248.987596] kthread+0x108/0x140 > [ 248.996747] ? process_one_work+0x440/0x440 > [ 249.005965] ? __kthread_create_on_node+0x150/0x150 > [ 249.015264] ret_from_fork+0x29/0x40 > [ 249.024544] INFO: task kworker/u40:5:245 blocked for more than 120 seconds. > > This is fairly obviously > > /* If it's a flush, we send the flush to the backing device too */ > closure_bio_submit(bio, cl); > > in drivers/md/bcache/request.c, which is fine. > > After that we start to see xfs writeback hangs, all hanging on > md_make_request() or I/O scheduling: > > [ 249.062946] Workqueue: writeback wb_workfn (flush-252:5) > [ 249.072733] Call Trace: > [ 249.082390] __schedule+0x290/0x810 > [ 249.092053] schedule+0x36/0x80 > [ 249.101568] md_make_request+0x9c/0x220 > [ 249.110956] ? wake_atomic_t_function+0x60/0x60 > [ 249.120202] generic_make_request+0xfc/0x260 > [ 249.129219] submit_bio+0x64/0x120 > [ 249.137968] ? submit_bio+0x64/0x120 > [ 249.146840] ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70 > [ 249.155822] xfs_submit_ioend+0x70/0x190 > [ 249.164805] xfs_vm_writepages+0x65/0x70 > [ 249.173620] do_writepages+0x1e/0x30 > [ 249.182239] __writeback_single_inode+0x45/0x320 > [ 249.190838] writeback_sb_inodes+0x191/0x490 > [ 249.199459] __writeback_inodes_wb+0x92/0xc0 > [ 249.207988] wb_writeback+0x243/0x2d0 > [ 249.216395] wb_workfn+0x299/0x360 > [ 249.224851] ? wb_workfn+0x299/0x360 > [ 249.233393] process_one_work+0x1e3/0x440 > [ 249.242049] worker_thread+0x48/0x4d0 > [ 249.250404] kthread+0x108/0x140 > [ 249.258408] ? process_one_work+0x440/0x440 > [ 249.266148] ? __kthread_create_on_node+0x150/0x150 > [ 249.273652] ret_from_fork+0x29/0x40 > > [ 249.310928] Workqueue: xfs-cil/dm-5 xlog_cil_push_work > [ 249.318453] Call Trace: > [ 249.325929] __schedule+0x290/0x810 > [ 249.333456] schedule+0x36/0x80 > [ 249.340940] schedule_timeout+0x207/0x360 > [ 249.348461] ? cached_dev_make_request+0x2bc/0xd00 > [ 249.356018] io_schedule_timeout+0x1e/0x50 > [ 249.363533] ? io_schedule_timeout+0x1e/0x50 > [ 249.371078] wait_for_common_io.constprop.0+0x92/0x110 > [ 249.378711] ? try_to_wake_up+0x400/0x400 > [ 249.386339] wait_for_completion_io+0x18/0x20 > [ 249.394026] submit_bio_wait+0x59/0x70 > [ 249.401685] blkdev_issue_flush+0x5c/0x90 > [ 249.409321] xfs_blkdev_issue_flush+0x19/0x20 > [ 249.416951] xlog_sync+0x2a1/0x3b0 > [ 249.424439] xlog_state_release_iclog+0x6d/0xd0 > [ 249.431857] xlog_write+0x64a/0x6e0 > [ 249.439092] xlog_cil_push+0x230/0x450 > [ 249.446169] xlog_cil_push_work+0x15/0x20 > [ 249.453128] process_one_work+0x1e3/0x440 > [ 249.460030] worker_thread+0x221/0x4d0 > [ 249.466942] kthread+0x108/0x140 > [ 249.473843] ? process_one_work+0x440/0x440 > [ 249.480762] ? __kthread_create_on_node+0x150/0x150 > [ 249.487743] ret_from_fork+0x29/0x40 > > [ 249.524000] Workqueue: xfs-log/dm-5 xfs_log_worker > [ 249.531490] Call Trace: > [ 249.538923] __schedule+0x290/0x810 > [ 249.546366] schedule+0x36/0x80 > [ 249.553756] schedule_timeout+0x207/0x360 > [ 249.561132] ? check_preempt_curr+0x79/0x90 > [ 249.568514] ? ttwu_do_wakeup.isra.15+0x1e/0x160 > [ 249.575957] wait_for_common+0xaa/0x160 > [ 249.583432] ? wait_for_common+0xaa/0x160 > [ 249.590943] ? try_to_wake_up+0x400/0x400 > [ 249.598447] wait_for_completion+0x1d/0x20 > [ 249.605984] flush_work+0xfb/0x1b0 > [ 249.613509] ? flush_workqueue_prep_pwqs+0x1a0/0x1a0 > [ 249.621093] xlog_cil_force_lsn+0x75/0x1d0 > [ 249.628640] ? dequeue_task_fair+0x757/0x1920 > [ 249.636169] _xfs_log_force+0x76/0x280 > [ 249.643524] ? xfs_log_worker+0x32/0x100 > [ 249.650715] xfs_log_force+0x2c/0x80 > [ 249.657719] xfs_log_worker+0x32/0x100 > [ 249.664653] process_one_work+0x1e3/0x440 > [ 249.671498] worker_thread+0x48/0x4d0 > [ 249.678354] kthread+0x108/0x140 > [ 249.685218] ? process_one_work+0x440/0x440 > [ 249.692123] ? __kthread_create_on_node+0x150/0x150 > [ 249.699057] ret_from_fork+0x29/0x40 > > [ 249.727654] syslog-ng D 0 1658 1657 0x00000000 > [ 249.735066] Call Trace: > [ 249.742333] __schedule+0x290/0x810 > [ 249.749624] schedule+0x36/0x80 > [ 249.756915] md_write_start+0x8f/0x160 > [ 249.764185] ? wake_atomic_t_function+0x60/0x60 > [ 249.771469] raid5_make_request+0x7a/0xd40 > [ 249.778762] ? bch_data_insert_start+0x4da/0x5e0 > [ 249.786143] ? wake_atomic_t_function+0x60/0x60 > [ 249.793571] md_make_request+0xe3/0x220 > [ 249.800976] generic_make_request+0xfc/0x260 > [ 249.808383] submit_bio+0x64/0x120 > [ 249.815838] ? submit_bio+0x64/0x120 > [ 249.823253] ? xfs_setfilesize_trans_alloc.isra.4+0x2f/0x70 > [ 249.830757] xfs_submit_ioend+0x70/0x190 > [ 249.838269] xfs_vm_writepages+0x65/0x70 > [ 249.845668] do_writepages+0x1e/0x30 > [ 249.852899] __filemap_fdatawrite_range+0x71/0x90 > [ 249.860023] filemap_write_and_wait_range+0x2a/0x70 > [ 249.867061] xfs_file_fsync+0x54/0x190 > [ 249.874007] vfs_fsync_range+0x49/0xa0 > [ 249.880934] ? __fdget_pos+0x17/0x50 > [ 249.887793] ? vfs_writev+0x3c/0x50 > [ 249.894700] do_fsync+0x3d/0x70 > [ 249.901565] SyS_fsync+0x10/0x20 > > I think we are stuck inside the > > if (mddev->suspended) { > > block in md_make_request() in all these cases. > > ... and finally it becomes clear that mdadm is blocked and we are in > real trouble: > > [ 249.965701] INFO: task mdadm:2712 blocked for more than 120 seconds. > [ 249.973003] Not tainted 4.11.2-00005-g87b2117c3309-dirty #2 > [ 249.980381] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 249.987999] mdadm D 0 2712 2020 0x00000000 > [ 249.995681] Call Trace: > [ 250.003273] __schedule+0x290/0x810 > [ 250.010885] schedule+0x36/0x80 > [ 250.018503] mddev_suspend+0xb3/0xe0 > [ 250.026158] ? wake_atomic_t_function+0x60/0x60 > [ 250.033856] level_store+0x1a7/0x6c0 > [ 250.041478] ? md_ioctl+0xb7/0x1c10 > [ 250.049075] ? putname+0x53/0x60 > [ 250.056554] md_attr_store+0x83/0xc0 > [ 250.063922] sysfs_kf_write+0x37/0x40 > [ 250.071082] kernfs_fop_write+0x110/0x1a0 > [ 250.078127] __vfs_write+0x28/0x120 > [ 250.085166] ? kernfs_iop_get_link+0x172/0x1e0 > [ 250.092199] ? __alloc_fd+0x3f/0x170 > [ 250.099271] vfs_write+0xb6/0x1d0 > [ 250.106307] SyS_write+0x46/0xb0 > [ 250.113227] entry_SYSCALL_64_fastpath+0x13/0x94 > > I'm not sure what this is waiting on. It could be *any* of > > synchronize_rcu(); > wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0); > mddev->pers->quiesce(mddev, 1); It says 'schedule()' so it is in wait_event(). > > since they all can probably schedule and if quiescing involves talking > to the device, we have already suspended it, so we're doomed. > > I'm not sure how we can quiesce before we suspend and prevent things > from trying to issue more crucial requests (paging in mdadm?) before we > resume again, but I'm fairly sure that *something* different needs to be > done. Congratulations. You have found a bug that dates from 2011. Commit: 68866e425be2 ("MD: no sync IO while suspended") (I think). A write request gets to raid5_make_request() before mddev_suspend() sets mddev->suspended. raid5_make_request() needs md_write_start() to mark the array "dirty" which it asks the md thread to do, but before the thread gets to the task, mddev->suspend has been set, so md_check_recovery() doesn't update the superblock, so md_write_start() doesn't proceed, so the request never completes, so mddev_suspend() blocks indefinitely. I think that md_check_recovery() need to test for mddev->suspended somewhere else. It needs to allow superblock updates, and probably needs to reap the recovery thread, but must not start a new recovery thread. Probably something like this: diff --git a/drivers/md/md.c b/drivers/md/md.c index f6ae1d67bcd0..dbca31be22a1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws) */ void md_check_recovery(struct mddev *mddev) { - if (mddev->suspended) - return; if (mddev->bitmap) bitmap_daemon_work(mddev); @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_DONE, &mddev->recovery); if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || + mddev->suspended || test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) goto not_running; /* no recovery is running. though it's late so don't trust anything I write. If you try again it will almost certainly succeed. I suspect this is a hard race to hit - well done!!! NeilBrown
Attachment:
signature.asc
Description: PGP signature