On Wed, Feb 21, 2018 at 09:26:55AM +0800, bingjingc wrote: > There is a potential deadlock if mount/umount happens when > raid5_finish_reshape() tries to grow the size of emulated disk. > > How the deadlock happens? > 1) The raid5 resync thread finished reshape (expanding array). > 2) The mount or umount thread holds VFS sb->s_umount lock and tries to > write through critical data into raid5 emulated block device. So it > waits for raid5 kernel thread handling stripes in order to finish it > I/Os. > 3) In the routine of raid5 kernel thread, md_check_recovery() will be > called first in order to reap the raid5 resync thread. That is, > raid5_finish_reshape() will be called. In this function, it will try > to update conf and call VFS revalidate_disk() to grow the raid5 > emulated block device. It will try to acquire VFS sb->s_umount lock. > The raid5 kernel thread cannot continue, so no one can handle mount/ > umount I/Os (stripes). Once the write-through I/Os cannot be finished, > mount/umount will not release sb->s_umount lock. The deadlock happens. > > The raid5 kernel thread is an emulated block device. It is responible to > handle I/Os (stripes) from upper layers. The emulated block device > should not request any I/Os on itself. That is, it should not call VFS > layer functions. (If it did, it will try to acquire VFS locks to > guarantee the I/Os sequence.) So we have the resync thread to send > resync I/O requests and to wait for the results. > > For solving this potential deadlock, we can put the size growth of the > emulated block device as the final step of reshape thread. > > 2017/12/29: > Thanks to Guoqing Jiang <gqjiang@xxxxxxxx>, > we confirmed that there is the same deadlock issue in raid10. It's > reproducible and can be fixed by this patch. For raid10.c, we can remove > the similar code to prevent deadlock as well since they has been called > before. > > Reported-by: Alex Wu <alexwu@xxxxxxxxxxxx> > Reviewed-by: Alex Wu <alexwu@xxxxxxxxxxxx> > Reviewed-by: Chung-Chiang Cheng <cccheng@xxxxxxxxxxxx> > Signed-off-by: BingJing Chang <bingjingc@xxxxxxxxxxxx> git am still complains the patch is corrupted. probably fix your email client? > --- > drivers/md/md.c | 13 +++++++++++++ > drivers/md/raid10.c | 8 +------- > drivers/md/raid5.c | 8 +------- > 3 files changed, 15 insertions(+), 14 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 4e4dee0..52092de 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8528,6 +8528,19 @@ void md_do_sync(struct md_thread *thread) > set_mask_bits(&mddev->sb_flags, 0, > BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS)); > > + if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && > + !test_bit(MD_RECOVERY_INTR, &mddev->recovery) && > + mddev->delta_disks > 0 && > + mddev->pers->finish_reshape && > + mddev->pers->size && > + mddev->queue) { > + mddev_lock_nointr(mddev); > + md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0)); > + mddev_unlock(mddev); > + set_capacity(mddev->gendisk, mddev->array_sectors); > + revalidate_disk(mddev->gendisk); > + } > + > spin_lock(&mddev->lock); > if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { > /* We completed so min/max setting can be forgotten if used. */ > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index c131835..8a46473 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -4818,17 +4818,11 @@ static void raid10_finish_reshape(struct mddev > *mddev) > return; > > if (mddev->delta_disks > 0) { > - sector_t size = raid10_size(mddev, 0, 0); > - md_set_array_sectors(mddev, size); > if (mddev->recovery_cp > mddev->resync_max_sectors) { > mddev->recovery_cp = mddev->resync_max_sectors; > set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > } > - mddev->resync_max_sectors = size; > - if (mddev->queue) { > - set_capacity(mddev->gendisk, mddev->array_sectors); > - revalidate_disk(mddev->gendisk); > - } > + mddev->resync_max_sectors = mddev->array_sectors; > } else { > int d; > rcu_read_lock(); > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 98ce427..0f68faa 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev *mddev) > > if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { > > - if (mddev->delta_disks > 0) { > - md_set_array_sectors(mddev, raid5_size(mddev, 0, 0)); > - if (mddev->queue) { > - set_capacity(mddev->gendisk, mddev->array_sectors); > - revalidate_disk(mddev->gendisk); > - } > - } else { > + if (mddev->delta_disks <= 0) { > int d; > spin_lock_irq(&conf->device_lock); > mddev->degraded = raid5_calc_degraded(conf); > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html