On Sun, Feb 18, 2024 at 9:15 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/02/16 14:58, Xiao Ni 写道: > > On Thu, Feb 1, 2024 at 5:30 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> From: Yu Kuai <yukuai3@xxxxxxxxxx> > >> > >> mddev_suspend() never stop sync_thread, hence it doesn't make sense to > >> ignore suspended array in md_check_recovery(), which might cause > >> sync_thread can't be unregistered. > >> > >> After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following > >> hang can be triggered by test shell/integrity-caching.sh: > > > > Hi Kuai > > > > After applying this patch, it's still stuck at mddev_suspend. Maybe > > the deadlock can be fixed by other patches from the patch set. But > > this patch can't fix this issue. If so, the comment is not right. > > This patch alone can't fix the problem that mddev_suspend() can stuck > thoroughly, patches 1-4 will all be needed. > > Thanks, > Kuai > > > > >> > >> 1) suspend the array: > >> raid_postsuspend > >> mddev_suspend > >> > >> 2) stop the array: > >> raid_dtr > >> md_stop > >> __md_stop_writes > >> stop_sync_thread > >> set_bit(MD_RECOVERY_INTR, &mddev->recovery); > >> md_wakeup_thread_directly(mddev->sync_thread); > >> wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) > >> > >> 3) sync thread done: > >> md_do_sync > >> set_bit(MD_RECOVERY_DONE, &mddev->recovery); > >> md_wakeup_thread(mddev->thread); > >> > >> 4) daemon thread can't unregister sync thread: > >> md_check_recovery > >> if (mddev->suspended) > >> return; -> return directly > >> md_read_sync_thread > >> clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); > >> -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; > > > > I add some debug logs when stopping dmraid with lvremove command. The > > step you mentioned are sequential but not async. The process is : > > dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend) > > -> dm_table_destroy(raid_dtr). It looks like mddev_suspend is waiting > > for active_io to be zero. The deadlock problem mentioned in this patch should not be right? Regards Xiao > > > > Best Regards > > Xiao > > > >> This problem is not just related to dm-raid, fix it by ignoring > >> suspended array in md_check_recovery(). And follow up patches will > >> improve dm-raid better to frozen sync thread during suspend. > >> > >> Reported-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> > >> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@xxxxxxxxxx/ > >> Fixes: 68866e425be2 ("MD: no sync IO while suspended") > >> Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") > >> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx> > >> --- > >> drivers/md/md.c | 3 --- > >> 1 file changed, 3 deletions(-) > >> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c > >> index 2266358d8074..07b80278eaa5 100644 > >> --- a/drivers/md/md.c > >> +++ b/drivers/md/md.c > >> @@ -9469,9 +9469,6 @@ static void md_start_sync(struct work_struct *ws) > >> */ > >> void md_check_recovery(struct mddev *mddev) > >> { > >> - if (READ_ONCE(mddev->suspended)) > >> - return; > >> - > >> if (mddev->bitmap) > >> md_bitmap_daemon_work(mddev); > >> > >> -- > >> 2.39.2 > >> > > > > > > . > > >