Re: [PATCH v2 1/2] md-cluster: fix hanging issue while a new disk adding

Song Liu <song@xxxxxxxxxx> · Fri, 12 Jul 2024 23:09:14 +0800

On Tue, Jul 9, 2024 at 7:06 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> 在 2024/07/09 18:41, Heming Zhao 写道:
> > The commit 1bbe254e4336 ("md-cluster: check for timeout while a
> > new disk adding") is correct in terms of code syntax but not
> > suite real clustered code logic.
> >
> > When a timeout occurs while adding a new disk, if recv_daemon()
> > bypasses the unlock for ack_lockres:CR, another node will be waiting
> > to grab EX lock. This will cause the cluster to hang indefinitely.
> >
> > How to fix:
> >
> > 1. In dlm_lock_sync(), change the wait behaviour from forever to a
> >     timeout, This could avoid the hanging issue when another node
> >     fails to handle cluster msg. Another result of this change is
> >     that if another node receives an unknown msg (e.g. a new msg_type),
> >     the old code will hang, whereas the new code will timeout and fail.
> >     This could help cluster_md handle new msg_type from different
> >     nodes with different kernel/module versions (e.g. The user only
> >     updates one leg's kernel and monitors the stability of the new
> >     kernel).
> > 2. The old code for __sendmsg() always returns 0 (success) under the
> >     design (must successfully unlock ->message_lockres). This commit
> >     makes this function return an error number when an error occurs.
> >
> > Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
> > Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
> > Reviewed-by: Su Yue <glass.su@xxxxxxxx>
>
> Thanks for the patch.
>
> Acked-by: Yu Kuai <yukuai3@xxxxxxxxxx>

Applied to md-6.11. Thanks!

Song