Re: [PATCH v5 01/14] md: don't ignore suspended array in md_check_recovery()

Xiao Ni <xni@xxxxxxxxxx> · Sun, 18 Feb 2024 13:07:34 +0800

On Sun, Feb 18, 2024 at 11:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2024/02/18 11:15, Xiao Ni 写道:
> > On Sun, Feb 18, 2024 at 10:34 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/18 10:27, Xiao Ni 写道:
> >>> On Sun, Feb 18, 2024 at 9:46 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2024/02/18 9:33, Xiao Ni 写道:
> >>>>> The deadlock problem mentioned in this patch should not be right?
> >>>>
> >>>> No, I think it's right. Looks like you are expecting other problems,
> >>>> like mentioned in patch 6, to be fixed by this patch.
> >>>
> >>> Hi Kuai
> >>>
> >>> Could you explain why step1 and step2 from this comment can happen
> >>> simultaneously? From the log, the process should be
> >>> The process is :
> >>> dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend)
> >>> -> dm_table_destroy(raid_dtr).
> >>> After suspending the array, it calls raid_dtr. So these two functions
> >>> can't happen simultaneously.
> >>
> >> You're removing the target directly, however, dm can suspend the disk
> >> directly, you can simplily:
> >>
> >> 1) dmsetup suspend xxx
> >> 2) dmsetup remove xxx
> >
> > For dm-raid, the design of suspend stops sync thread first and then it
> > calls mddev_suspend to suspend array. So I'm curious why the sync
> > thread can still exit when array is suspended. I know the reason now.
> > Because before f52f5c71f (md: fix stopping sync thread), the process
> > is raid_postsuspend->md_stop_writes->__md_stop_writes
> > (__md_stop_writes sets MD_RECOVERY_FROZEN). In patch f52f5c71f, it
> > doesn't set MD_RECOVERY_FROZEN in __md_stop_writes anymore.
> >
> > The process changes to
> > 1. raid_postsuspend->md_stop_writes->__md_stop_writes->stop_sync_thread
> > (wait until MD_RECOVERY_RUNNING clears)
> > 2. md thread -> md_check_recovery -> unregister_sync_thread ->
> > md_reap_sync_thread (clears MD_RECOVERY_RUNNING, stop_sync_thread
> > returns, md_reap_sync_thread sets MD_RECOVERY_NEEDED)
> > 3. raid_postsuspend->mddev_suspend
> > 4. md sync thread starts again because __md_stop_writes doesn't set
> > MD_RECOVERY_FROZEN.
> > It's the reason why we can see sync thread still happens when raid is suspended.
> >
> > So the patch fix this problem should:
>
> As I said, this is really a different problem from this patch, and it is
> fixed seperately by patch 9. Please take a look at that patch.

I think we're talking about the same problem. In patch07 it has a new
api md_frozen_sync_thread. It sets MD_RECOVERY_FROZEN before
stop_sync_thread. This is right. If we use this api in
raid_postsuspend, sync thread can't restart. So the deadlock can't
happen anymore?

And patch01 is breaking one logic which seems right:

commit 68866e425be2ef2664aa5c691bb3ab789736acf5
Author: Jonathan Brassow <jbrassow@xxxxxxxxxxxxxx>
Date:   Wed Jun 8 15:10:08 2011 +1000

    MD: no sync IO while suspended

    Disallow resync I/O while the RAID array is suspended.

We're trying to fix deadlock problems. But it's not good to fix a
problem by breaking an existing rule.

Regards
Xiao

>
> Thanks,
> Kuai
>
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 9e41a9aaba8b..666761466f02 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -6315,6 +6315,7 @@ static void md_clean(struct mddev *mddev)
> >
> >   static void __md_stop_writes(struct mddev *mddev)
> >   {
> > +       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >          stop_sync_thread(mddev, true, false);
> >          del_timer_sync(&mddev->safemode_timer);
> >
> > Like other places which call stop_sync_thread, it needs to set the
> > MD_RECOVERY_FROZEN bit.
> >
> > Regards
> > Xiao
> >
> >>
> >> Please also take a look at other patches, why step 1) can't stop sync
> >> thread.
> >>
> >> Thanks,
> >> Kuai
> >>
> >>>
> >>>
> >>>>
> >>>> Noted that this patch just fix one case that MD_RECOVERY_RUNNING can't
> >>>> be cleared, I you are testing this patch alone, please make sure that
> >>>> you still triggered the exactly same case:
> >>>>
> >>>> - MD_RCOVERY_RUNNING can't be cleared while array is suspended.
> >>>
> >>> I'm not testing this patch. I want to understand the patch well. So I
> >>> need to understand the issue first. I can't understand how this
> >>> deadlock (step1,step2) happens.
> >>>
> >>> Regards
> >>> Xiao
> >>>>
> >>>> Thanks,
> >>>> Kuai
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>