Re: [PATCH] block: fix deadlock between bd_link_disk_holder and partition scan

Song Liu <song@xxxxxxxxxx> · Sun, 18 Feb 2024 21:14:15 -0800

On Sat, Feb 17, 2024 at 11:47 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2024/02/17 3:03, Song Liu 写道:
> > On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@xxxxxxxxxx> wrote:
> >>
> >> On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> 在 2024/2/8 14:50, Song Liu 写道:
> >>>> On Wed, Feb 7, 2024 at 1:32 AM <linan666@xxxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> From: Li Nan <linan122@xxxxxxxxxx>
> >>>>>
> >>>>> 'open_mutex' of gendisk is used to protect open/close block devices. But
> >>>>> in bd_link_disk_holder(), it is used to protect the creation of symlink
> >>>>> between holding disk and slave bdev, which introduces some issues.
> >>>>>
> >>>>> When bd_link_disk_holder() is called, the driver is usually in the process
> >>>>> of initialization/modification and may suspend submitting io. At this
> >>>>> time, any io hold 'open_mutex', such as scanning partitions, can cause
> >>>>> deadlocks. For example, in raid:
> >>>>>
> >>>>> T1                              T2
> >>>>> bdev_open_by_dev
> >>>>>    lock open_mutex [1]
> >>>>>    ...
> >>>>>     efi_partition
> >>>>>     ...
> >>>>>      md_submit_bio
> >>>>>                                   md_ioctl mddev_syspend
> >>>>>                                     -> suspend all io
> >>>>>                                    md_add_new_disk
> >>>>>                                     bind_rdev_to_array
> >>>>>                                      bd_link_disk_holder
> >>>>>                                       try lock open_mutex [2]
> >>>>>       md_handle_request
> >>>>>        -> wait mddev_resume
> >>>>>
> >>>>> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume
> >>>>> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs.
> >>>>>
> >>>>> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'.
> >>>>
> >>>> Is this to fix [1]? Do we need some Fixes and/or Closes tags?
> >>>>
> >>>
> >>> No. Just use another way to fix [2], and both [2] and this patch can fix
> >>> the issue. I am not sure about the root cause of [1] yet.
> >>>
> >>> [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045
> >>>
> >>>> Could you please add steps to reproduce this issue?
> >>>
> >>> We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl()
> >>> as below, and then:
> >>>     1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc]  #create a raid
> >>>     2. echo 1 > /sys/module/md_mod/parameters/error_inject  #enable sleep
> >>>     3. 'mdadm --add /dev/md0 /dev/sda'  #add a disk to raid
> >>>     4. submit ioctl BLKRRPART to raid within 10s.
> >>
> >> The analysis makes sense. I also hit the issue a couple times without adding
> >> extra delays. But I am not sure whether this is the best fix (I didn't find real
> >> issues with it either).
> >
> > To be extra safe and future proof, we can do something like the
> > following to only
> > suspend the array for ADD_NEW_DISK on not-running arrays.
> >
> > This appear to solve the problem reported in
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=218459
> >
> > Thanks,
> > Song
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 9e41a9aaba8b..395911d5f4d6 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd)
> >          }
> >   }
> >
> > -static bool md_ioctl_need_suspend(unsigned int cmd)
> > +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd)
> >   {
> >          switch (cmd) {
> >          case ADD_NEW_DISK:
> > +               return mddev->pers != NULL;
>
> Did you check already that this problem is not related that 'active_io'
> is leaked for flush IO?
>
> I don't understand the problem reported yet. If 'mddev->pers' is not set
> yet, md_submit_bio() will return directly, and 'active_io' should not be
> grabbed in the first place.

AFAICT, this is not related to the active_io issue.

>
> md_run() is the only place to convert 'mddev->pers' from NULL to a real
> personality, and it's protected by 'reconfig_mutex', however,
> md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there
> is a race condition:
>
> md_ioctl_need_suspend           array_state_store
>   // mddev->pers is NULL, return false
>                                  mddev_lock
>                                  do_md_run
>                                   mddev->pers = xxx
>                                  mddev_unlock
>
>   // mddev_suspend is not called
>   mddev_lock
>   md_add_new_disk
>    if (mddev->pers)
>     md_import_device
>     bind_rdev_to_array
>     add_bound_rdev
>      mddev->pers->hot_add_disk
>      -> hot add disk without suspending

Yeah, this race condition exists. We probably need some
trick with suspend and lock here.

Thanks,
Song