On Tue, Jul 9, 2024 at 7:06 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > 在 2024/07/09 18:41, Heming Zhao 写道: > > The commit 1bbe254e4336 ("md-cluster: check for timeout while a > > new disk adding") is correct in terms of code syntax but not > > suite real clustered code logic. > > > > When a timeout occurs while adding a new disk, if recv_daemon() > > bypasses the unlock for ack_lockres:CR, another node will be waiting > > to grab EX lock. This will cause the cluster to hang indefinitely. > > > > How to fix: > > > > 1. In dlm_lock_sync(), change the wait behaviour from forever to a > > timeout, This could avoid the hanging issue when another node > > fails to handle cluster msg. Another result of this change is > > that if another node receives an unknown msg (e.g. a new msg_type), > > the old code will hang, whereas the new code will timeout and fail. > > This could help cluster_md handle new msg_type from different > > nodes with different kernel/module versions (e.g. The user only > > updates one leg's kernel and monitors the stability of the new > > kernel). > > 2. The old code for __sendmsg() always returns 0 (success) under the > > design (must successfully unlock ->message_lockres). This commit > > makes this function return an error number when an error occurs. > > > > Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") > > Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx> > > Reviewed-by: Su Yue <glass.su@xxxxxxxx> > > Thanks for the patch. > > Acked-by: Yu Kuai <yukuai3@xxxxxxxxxx> Applied to md-6.11. Thanks! Song