On Sat, Feb 17, 2024 at 11:47 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/02/17 3:03, Song Liu 写道: > > On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@xxxxxxxxxx> wrote: > >> > >> On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote: > >>> > >>> > >>> > >>> 在 2024/2/8 14:50, Song Liu 写道: > >>>> On Wed, Feb 7, 2024 at 1:32 AM <linan666@xxxxxxxxxxxxxxx> wrote: > >>>>> > >>>>> From: Li Nan <linan122@xxxxxxxxxx> > >>>>> > >>>>> 'open_mutex' of gendisk is used to protect open/close block devices. But > >>>>> in bd_link_disk_holder(), it is used to protect the creation of symlink > >>>>> between holding disk and slave bdev, which introduces some issues. > >>>>> > >>>>> When bd_link_disk_holder() is called, the driver is usually in the process > >>>>> of initialization/modification and may suspend submitting io. At this > >>>>> time, any io hold 'open_mutex', such as scanning partitions, can cause > >>>>> deadlocks. For example, in raid: > >>>>> > >>>>> T1 T2 > >>>>> bdev_open_by_dev > >>>>> lock open_mutex [1] > >>>>> ... > >>>>> efi_partition > >>>>> ... > >>>>> md_submit_bio > >>>>> md_ioctl mddev_syspend > >>>>> -> suspend all io > >>>>> md_add_new_disk > >>>>> bind_rdev_to_array > >>>>> bd_link_disk_holder > >>>>> try lock open_mutex [2] > >>>>> md_handle_request > >>>>> -> wait mddev_resume > >>>>> > >>>>> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > >>>>> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > >>>>> > >>>>> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. > >>>> > >>>> Is this to fix [1]? Do we need some Fixes and/or Closes tags? > >>>> > >>> > >>> No. Just use another way to fix [2], and both [2] and this patch can fix > >>> the issue. I am not sure about the root cause of [1] yet. > >>> > >>> [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 > >>> > >>>> Could you please add steps to reproduce this issue? > >>> > >>> We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() > >>> as below, and then: > >>> 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid > >>> 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep > >>> 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid > >>> 4. submit ioctl BLKRRPART to raid within 10s. > >> > >> The analysis makes sense. I also hit the issue a couple times without adding > >> extra delays. But I am not sure whether this is the best fix (I didn't find real > >> issues with it either). > > > > To be extra safe and future proof, we can do something like the > > following to only > > suspend the array for ADD_NEW_DISK on not-running arrays. > > > > This appear to solve the problem reported in > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218459 > > > > Thanks, > > Song > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 9e41a9aaba8b..395911d5f4d6 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd) > > } > > } > > > > -static bool md_ioctl_need_suspend(unsigned int cmd) > > +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd) > > { > > switch (cmd) { > > case ADD_NEW_DISK: > > + return mddev->pers != NULL; > > Did you check already that this problem is not related that 'active_io' > is leaked for flush IO? > > I don't understand the problem reported yet. If 'mddev->pers' is not set > yet, md_submit_bio() will return directly, and 'active_io' should not be > grabbed in the first place. AFAICT, this is not related to the active_io issue. > > md_run() is the only place to convert 'mddev->pers' from NULL to a real > personality, and it's protected by 'reconfig_mutex', however, > md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there > is a race condition: > > md_ioctl_need_suspend array_state_store > // mddev->pers is NULL, return false > mddev_lock > do_md_run > mddev->pers = xxx > mddev_unlock > > // mddev_suspend is not called > mddev_lock > md_add_new_disk > if (mddev->pers) > md_import_device > bind_rdev_to_array > add_bound_rdev > mddev->pers->hot_add_disk > -> hot add disk without suspending Yeah, this race condition exists. We probably need some trick with suspend and lock here. Thanks, Song