While adding an additional drive to a raid6 array, the reshape stalled at about 13% complete and any I/O operations on the array hung, creating an effective soft lock. The kernel reported a hung task in mdXX_reshape thread and I had to use magic sysrq to recover as systemd hung as well. I first suspected an issue with one of the underlying block devices and as precaution I recovered the data in read only mode to a new array, but it turned out to be in the RAID layer as I was able to recreate the issue from a superblock dump in sparse files. After poking around some I discovered that I had somehow propagated the bad block list to several devices in the array such that a few blocks were unreable. The bad read reported correctly in userspace during recovery, but it wasn't obvious that it was from a bad block list metadata at the time and instead confirmed my bias suspecting hardware issues I was able to reproduce the issue with a minimal test case using small loopback devices. I put a script for this in a github repository: https://github.com/dougvj/md_badblock_reshape_stall_test This patch handles bad reads during a reshape by introducing a handle_failed_reshape function in a similar manner to handle_failed_resync. The function aborts the current stripe by unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe position, and reports the situation in dmesg. Signed-off-by: Doug V Johnson <dougvj@xxxxxxxxxx> --- drivers/md/raid5.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 5c79429acc64..bc0b0c2540f0 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh, md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort); } +static void +handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh, + struct stripe_head_state *s) +{ + // Abort the current stripe + clear_bit(STRIPE_EXPANDING, &sh->state); + clear_bit(STRIPE_EXPAND_READY, &sh->state); + pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress", + mdname(conf->mddev), + (unsigned long)sh->sector); + // Freeze the reshape + set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery); + // Revert progress to safe position + spin_lock_irq(&conf->device_lock); + conf->reshape_progress = conf->reshape_safe; + spin_unlock_irq(&conf->device_lock); + // report failed md sync + md_done_sync(conf->mddev, 0, 0); + wake_up(&conf->wait_for_reshape); +} + static int want_replace(struct stripe_head *sh, int disk_idx) { struct md_rdev *rdev; @@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh) handle_failed_stripe(conf, sh, &s, disks); if (s.syncing + s.replacing) handle_failed_sync(conf, sh, &s); + if (test_bit(STRIPE_EXPANDING, &sh->state)) + handle_failed_reshape(conf, sh, &s); } /* Now we check to see if any write operations have recently -- 2.48.1