Hi,
在 2024/11/05 17:01, Xiao Ni 写道:
On Tue, Nov 5, 2024 at 4:29 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
Hi,
在 2024/11/05 16:16, Mariusz Tkaczyk 写道:
On Tue, 5 Nov 2024 15:57:33 +0800
Xiao Ni <xni@xxxxxxxxxx> wrote:
One customer reports a bug: raid5 is hung when changing thread cnt
while resync is running. The stripes are all in conf->handle_list
and new threads can't handle them.
Is issue fixed with this patch? Is is missing here :)
Commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()") removes
pers->quiesce from mddev_suspend/resume, then we can't guarantee sync
requests finish in suspend operation. One personality knows itself the
best. So pers->quiesce is a proper way to let personality quiesce.
Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()")
Signed-off-by: Xiao Ni <xni@xxxxxxxxxx>
---
drivers/md/md.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 67108c397c5a..7409ecb2df68 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -482,6 +482,9 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
return err;
}
+ if (mddev->pers)
+ mddev->pers->quiesce(mddev, 1);
+
Shouldn't it be implemented as below? According to b39f35ebe86d, some levels are
not implementing this?
+ if (mddev->pers && mddev->pers->quiesce)
+ mddev->pers->quiesce(mddev, 1);
It's fine, the fops is never NULL, just some levels points to an empty
function.
The tricky part here is that accessing "mddev->pers" is not safe here,
'reconfig_mutex' is not held in mddev_suspend(), and can concurrent
with, for example, change levels. :(
Hi Kuai
Now mddev->suspended is protected by mddev->suspend_mutex. It should
can avoid the problem you mentioned above? level_store calls
mddev_suspend and mddev->suspended is set to mddev->suspended+1. So
other paths will return because mddev->suspended is not 0.
Yeah, level store is just an wrong example, key point here is that
mddev->pers is not protected by 'suspend_mutex'. Other places are
md_run() and md_stop(), these path doesn't hold 'suspend_mutex' right?
And they look like can concurrent with suspend, for example,
suspend_lo_store()
Did you consider just fail raid5_store_group_thread_cnt() if sync_thread
is active?
Or call ->quiesce() directly here with 'reconfig_mutex' held before
suspend?
Thanks,
Kuai
Regards
Xiao
Thanks,
Kuai
Is it reproducible with upstream kernel?
Thanks,
Mariusz
.
.