On Tue, Nov 5, 2024 at 5:22 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2024/11/05 17:01, Xiao Ni 写道: > > On Tue, Nov 5, 2024 at 4:29 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> Hi, > >> > >> 在 2024/11/05 16:16, Mariusz Tkaczyk 写道: > >>> On Tue, 5 Nov 2024 15:57:33 +0800 > >>> Xiao Ni <xni@xxxxxxxxxx> wrote: > >>> > >>>> One customer reports a bug: raid5 is hung when changing thread cnt > >>>> while resync is running. The stripes are all in conf->handle_list > >>>> and new threads can't handle them. > >>> > >>> Is issue fixed with this patch? Is is missing here :) > >>>> > >>>> Commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()") removes > >>>> pers->quiesce from mddev_suspend/resume, then we can't guarantee sync > >>>> requests finish in suspend operation. One personality knows itself the > >>>> best. So pers->quiesce is a proper way to let personality quiesce. > >>> > >>>> > >>>> Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") > >>>> Signed-off-by: Xiao Ni <xni@xxxxxxxxxx> > >>>> --- > >>>> drivers/md/md.c | 6 ++++++ > >>>> 1 file changed, 6 insertions(+) > >>>> > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > >>>> index 67108c397c5a..7409ecb2df68 100644 > >>>> --- a/drivers/md/md.c > >>>> +++ b/drivers/md/md.c > >>>> @@ -482,6 +482,9 @@ int mddev_suspend(struct mddev *mddev, bool interruptible) > >>>> return err; > >>>> } > >>>> > >>>> + if (mddev->pers) > >>>> + mddev->pers->quiesce(mddev, 1); > >>>> + > >>> > >>> Shouldn't it be implemented as below? According to b39f35ebe86d, some levels are > >>> not implementing this? > >>> > >>> + if (mddev->pers && mddev->pers->quiesce) > >>> + mddev->pers->quiesce(mddev, 1); > >> > >> It's fine, the fops is never NULL, just some levels points to an empty > >> function. > >> > >> The tricky part here is that accessing "mddev->pers" is not safe here, > >> 'reconfig_mutex' is not held in mddev_suspend(), and can concurrent > >> with, for example, change levels. :( > > > > Hi Kuai > > > > Now mddev->suspended is protected by mddev->suspend_mutex. It should > > can avoid the problem you mentioned above? level_store calls > > mddev_suspend and mddev->suspended is set to mddev->suspended+1. So > > other paths will return because mddev->suspended is not 0. > > Yeah, level store is just an wrong example, key point here is that > mddev->pers is not protected by 'suspend_mutex'. Other places are > md_run() and md_stop(), these path doesn't hold 'suspend_mutex' right? > And they look like can concurrent with suspend, for example, > suspend_lo_store() Yes, thanks for pointing out this. I need to think more about this. > > Did you consider just fail raid5_store_group_thread_cnt() if sync_thread > is active? It's one method, but not a good one. > > Or call ->quiesce() directly here with 'reconfig_mutex' held before > suspend? I'll think this way. Thanks! Regards Xiao > > Thanks, > Kuai > > > > > Regards > > Xiao > >> > >> Thanks, > >> Kuai > >> > >>> > >>> Is it reproducible with upstream kernel? > >>> > >>> Thanks, > >>> Mariusz > >>> > >>> . > >>> > >> > > > > > > . > > >