Re: [PATCH v5 01/14] md: don't ignore suspended array in md_check_recovery()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 18, 2024 at 10:34 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2024/02/18 10:27, Xiao Ni 写道:
> > On Sun, Feb 18, 2024 at 9:46 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/18 9:33, Xiao Ni 写道:
> >>> The deadlock problem mentioned in this patch should not be right?
> >>
> >> No, I think it's right. Looks like you are expecting other problems,
> >> like mentioned in patch 6, to be fixed by this patch.
> >
> > Hi Kuai
> >
> > Could you explain why step1 and step2 from this comment can happen
> > simultaneously? From the log, the process should be
> > The process is :
> > dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend)
> > -> dm_table_destroy(raid_dtr).
> > After suspending the array, it calls raid_dtr. So these two functions
> > can't happen simultaneously.
>
> You're removing the target directly, however, dm can suspend the disk
> directly, you can simplily:
>
> 1) dmsetup suspend xxx
> 2) dmsetup remove xxx

For dm-raid, the design of suspend stops sync thread first and then it
calls mddev_suspend to suspend array. So I'm curious why the sync
thread can still exit when array is suspended. I know the reason now.
Because before f52f5c71f (md: fix stopping sync thread), the process
is raid_postsuspend->md_stop_writes->__md_stop_writes
(__md_stop_writes sets MD_RECOVERY_FROZEN). In patch f52f5c71f, it
doesn't set MD_RECOVERY_FROZEN in __md_stop_writes anymore.

The process changes to
1. raid_postsuspend->md_stop_writes->__md_stop_writes->stop_sync_thread
(wait until MD_RECOVERY_RUNNING clears)
2. md thread -> md_check_recovery -> unregister_sync_thread ->
md_reap_sync_thread (clears MD_RECOVERY_RUNNING, stop_sync_thread
returns, md_reap_sync_thread sets MD_RECOVERY_NEEDED)
3. raid_postsuspend->mddev_suspend
4. md sync thread starts again because __md_stop_writes doesn't set
MD_RECOVERY_FROZEN.
It's the reason why we can see sync thread still happens when raid is suspended.

So the patch fix this problem should:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9e41a9aaba8b..666761466f02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6315,6 +6315,7 @@ static void md_clean(struct mddev *mddev)

 static void __md_stop_writes(struct mddev *mddev)
 {
+       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
        stop_sync_thread(mddev, true, false);
        del_timer_sync(&mddev->safemode_timer);

Like other places which call stop_sync_thread, it needs to set the
MD_RECOVERY_FROZEN bit.

Regards
Xiao

>
> Please also take a look at other patches, why step 1) can't stop sync
> thread.
>
> Thanks,
> Kuai
>
> >
> >
> >>
> >> Noted that this patch just fix one case that MD_RECOVERY_RUNNING can't
> >> be cleared, I you are testing this patch alone, please make sure that
> >> you still triggered the exactly same case:
> >>
> >> - MD_RCOVERY_RUNNING can't be cleared while array is suspended.
> >
> > I'm not testing this patch. I want to understand the patch well. So I
> > need to understand the issue first. I can't understand how this
> > deadlock (step1,step2) happens.
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>
> >
> > .
> >
>






[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux