On Sun, Feb 18, 2024 at 11:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/02/18 11:15, Xiao Ni 写道: > > On Sun, Feb 18, 2024 at 10:34 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> Hi, > >> > >> 在 2024/02/18 10:27, Xiao Ni 写道: > >>> On Sun, Feb 18, 2024 at 9:46 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >>>> > >>>> Hi, > >>>> > >>>> 在 2024/02/18 9:33, Xiao Ni 写道: > >>>>> The deadlock problem mentioned in this patch should not be right? > >>>> > >>>> No, I think it's right. Looks like you are expecting other problems, > >>>> like mentioned in patch 6, to be fixed by this patch. > >>> > >>> Hi Kuai > >>> > >>> Could you explain why step1 and step2 from this comment can happen > >>> simultaneously? From the log, the process should be > >>> The process is : > >>> dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend) > >>> -> dm_table_destroy(raid_dtr). > >>> After suspending the array, it calls raid_dtr. So these two functions > >>> can't happen simultaneously. > >> > >> You're removing the target directly, however, dm can suspend the disk > >> directly, you can simplily: > >> > >> 1) dmsetup suspend xxx > >> 2) dmsetup remove xxx > > > > For dm-raid, the design of suspend stops sync thread first and then it > > calls mddev_suspend to suspend array. So I'm curious why the sync > > thread can still exit when array is suspended. I know the reason now. > > Because before f52f5c71f (md: fix stopping sync thread), the process > > is raid_postsuspend->md_stop_writes->__md_stop_writes > > (__md_stop_writes sets MD_RECOVERY_FROZEN). In patch f52f5c71f, it > > doesn't set MD_RECOVERY_FROZEN in __md_stop_writes anymore. > > > > The process changes to > > 1. raid_postsuspend->md_stop_writes->__md_stop_writes->stop_sync_thread > > (wait until MD_RECOVERY_RUNNING clears) > > 2. md thread -> md_check_recovery -> unregister_sync_thread -> > > md_reap_sync_thread (clears MD_RECOVERY_RUNNING, stop_sync_thread > > returns, md_reap_sync_thread sets MD_RECOVERY_NEEDED) > > 3. raid_postsuspend->mddev_suspend > > 4. md sync thread starts again because __md_stop_writes doesn't set > > MD_RECOVERY_FROZEN. > > It's the reason why we can see sync thread still happens when raid is suspended. > > > > So the patch fix this problem should: > > As I said, this is really a different problem from this patch, and it is > fixed seperately by patch 9. Please take a look at that patch. I think we're talking about the same problem. In patch07 it has a new api md_frozen_sync_thread. It sets MD_RECOVERY_FROZEN before stop_sync_thread. This is right. If we use this api in raid_postsuspend, sync thread can't restart. So the deadlock can't happen anymore? And patch01 is breaking one logic which seems right: commit 68866e425be2ef2664aa5c691bb3ab789736acf5 Author: Jonathan Brassow <jbrassow@xxxxxxxxxxxxxx> Date: Wed Jun 8 15:10:08 2011 +1000 MD: no sync IO while suspended Disallow resync I/O while the RAID array is suspended. We're trying to fix deadlock problems. But it's not good to fix a problem by breaking an existing rule. Regards Xiao > > Thanks, > Kuai > > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 9e41a9aaba8b..666761466f02 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -6315,6 +6315,7 @@ static void md_clean(struct mddev *mddev) > > > > static void __md_stop_writes(struct mddev *mddev) > > { > > + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > stop_sync_thread(mddev, true, false); > > del_timer_sync(&mddev->safemode_timer); > > > > Like other places which call stop_sync_thread, it needs to set the > > MD_RECOVERY_FROZEN bit. > > > > Regards > > Xiao > > > >> > >> Please also take a look at other patches, why step 1) can't stop sync > >> thread. > >> > >> Thanks, > >> Kuai > >> > >>> > >>> > >>>> > >>>> Noted that this patch just fix one case that MD_RECOVERY_RUNNING can't > >>>> be cleared, I you are testing this patch alone, please make sure that > >>>> you still triggered the exactly same case: > >>>> > >>>> - MD_RCOVERY_RUNNING can't be cleared while array is suspended. > >>> > >>> I'm not testing this patch. I want to understand the patch well. So I > >>> need to understand the issue first. I can't understand how this > >>> deadlock (step1,step2) happens. > >>> > >>> Regards > >>> Xiao > >>>> > >>>> Thanks, > >>>> Kuai > >>>> > >>> > >>> . > >>> > >> > > > > . > > >