On Mon, May 22 2017, Nix wrote: > [linux-bcache removed, not a bcache bug] > > On 22 May 2017, NeilBrown said this: > >> Congratulations. You have found a bug that dates from 2011. > > Oh so not that old then! (My personal record, in a previous job, was a > bug in allegedly critical functionality that entirely stopped it working > and had been broken for around fourteen years. Like hell it was > critical.) > >> Commit: 68866e425be2 ("MD: no sync IO while suspended") >> >> (I think). >> >> A write request gets to raid5_make_request() before mddev_suspend() sets >> mddev->suspended. >> raid5_make_request() needs md_write_start() to mark the array "dirty" >> which it asks the md thread to do, but before the thread gets to the >> task, mddev->suspend has been set, so md_check_recovery() doesn't update >> the superblock, so md_write_start() doesn't proceed, so the request >> never completes, so mddev_suspend() blocks indefinitely. > > Ooof. Looks like it's fsync()ing something -- not sure what, though. > The previous time it was a metadata update... > > Hm. I note that ->events is meant to be for things like device additions > and array starts. RAID5 -> RAID6 reshaping appears to push this up a > lot: > > Events : 1475948 > > (it was at around 4000 before this -- itself rather mystifying, given > that I've only rebooted this machine 27 times, and the array's probably > been assembled less than 20 times, and was created with the same number > of devices it has now.) Originally, Events was for any metadata change, including clean <-> dirty transitions. This kept waking up spares though, so I change to change by -1 when changing from dirty to clean, and not bother with telling the spares. That was a little problematic too, so no it always increments except f or dirty->clean transitions where there are spares. For a reshape, the reshape is checkpointed quite frequently, and each checkpoint increases the event counter. > > The other array has only 92 events (it was at 30-something before I > tried this reshape). > >> I think that md_check_recovery() need to test for mddev->suspended >> somewhere else. >> It needs to allow superblock updates, and probably needs to reap the >> recovery thread, but must not start a new recovery thread. > > My question would be, is all the intricate stuff md_check_recovery() is > doing valid on an mddev that's not only suspended but detached? Because > we detach right after the suspension in level_store()... mddev_detach() kills the thread. After it completes, md_check_recovery() cannot possibly run. However I now see that my patch was insufficient. mddev_suspend() is called while mddev_lock() is held, and that prevents md_check_recovery() from writing the superblock, so it can still deadlock. This will require more careful analysis. > > (btw, you probably want to remove the comment above enum array_state > that says that "suspended" is "not supported yet", too.) Yeah.... though it isn't supported by all personalities I think. > >> Probably something like this: >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index f6ae1d67bcd0..dbca31be22a1 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws) >> */ >> void md_check_recovery(struct mddev *mddev) >> { >> - if (mddev->suspended) >> - return; >> >> if (mddev->bitmap) >> bitmap_daemon_work(mddev); >> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev) >> clear_bit(MD_RECOVERY_DONE, &mddev->recovery); >> >> if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || >> + mddev->suspended || >> test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) >> goto not_running; >> /* no recovery is running. >> >> though it's late so don't trust anything I write. > > This may end up clearing MD_RECOVERY_INTR and MD_RECOVERY_DONE, but I > guess in this context it's safe to do so... looks OK otherwise to the > totally ignorant fool reading this patch! > > Hm, being picky now, md_check_recovery() might be slightly easier to > read if that governing mutex_trylock() conditional was inverted: > > if (mddev_trylock(mddev)) > return; > > Then you could de-indent most of the function by one indentation > level... You could send a patch if you liked.... but as it looks like I have more work to do there, I'll probably do it. > >> If you try again it will almost certainly succeed. I suspect this is a >> hard race to hit - well done!!! > > I'll give it a try -- I hit it twice in succession, once with a > --backup-file, once without. Since mdadm does not warn about the lack of > a --backup-file, I guess the statement in the manual that it is > essential to provide one when changing RAID levels is untrue: I suspect > that it's necessary *if* you're not increasing the number of disks at > the same time, but since I'm growing into a spare, adding a > --backup-file only slows it down? > > I might run a backup first though. :) Always a good idea, if you can. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature