Hi,
在 2024/06/28 20:32, Heming Zhao 写道:
On 6/27/24 20:52, Yu Kuai wrote:
Hi,
在 2024/06/12 10:19, Heming Zhao 写道:
The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.
When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.
How to fix:
1. In dlm_lock_sync(), change the wait behaviour from forever to a
timeout, This could avoid the hanging issue when another node
fails to handle cluster msg. Another result of this change is
that if another node receives an unknown msg (e.g. a new msg_type),
the old code will hang, whereas the new code will timeout and fail.
This could help cluster_md handle new msg_type from different
nodes with different kernel/module versions (e.g. The user only
updates one leg's kernel and monitors the stability of the new
kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
design (must successfully unlock ->message_lockres). This commit
makes this function return an error number when an error occurs.
Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk
adding")
Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
Reviewed-by: Su Yue <glass.su@xxxxxxxx>
---
drivers/md/md-cluster.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 8e36a0feec09..27eaaf9fef94 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -130,8 +130,13 @@ static int dlm_lock_sync(struct
dlm_lock_resource *res, int mode)
0, sync_ast, res, res->bast);
if (ret)
return ret;
- wait_event(res->sync_locking, res->sync_locking_done);
+ ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
+ 60 * HZ);
Let's not use magic number directly, it's better to define a marco. BTW,
60s looks too long for me.
got it, will create a define:
#define WAIT_DLM_LOCK_TIMEOUT 30 * HZ
In my view, the shortest time should be 30s. because there is a
clustered env.
Node A is waiting for node B to release the lock.
We should consider:
- network traffic (node A and B are not in the same build)
- another node's udev event handling time: NEW_DEV_TIMEOUT 5000
res->sync_locking_done = false;
And I tried to find, if setting this value on failure is ok. However,
I'm lost and I really don't know. Can you explain this?
This code logic is the same as dlm_lock_sync_interruptible(). We can
see that regardless of success or failure, '->sync_locking_done' is
set to false in dlm_lock_sync_interruptible().
+ if (!ret) {
+ pr_err("locking DLM '%s' timeout!\n", res->name);
+ return -EBUSY;
+ }
if (res->lksb.sb_status == 0)
res->mode = mode;
return res->lksb.sb_status;
@@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info
*cinfo)
static int __sendmsg(struct md_cluster_info *cinfo, struct
cluster_msg *cmsg)
{
int error;
+ int ret = 0;
int slot = cinfo->slot_number - 1;
cmsg->slot = cpu_to_le32(slot);
/*get EX on Message*/
error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
if (error) {
+ ret = error;
You can return error directly in this branch.
OK
pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n",
error);
goto failed_message;
}
@@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info
*cinfo, struct cluster_msg *cmsg)
/*down-convert EX to CW on Message*/
error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
if (error) {
+ ret = error;
pr_err("md-cluster: failed to convert EX to CW on
MESSAGE(%d)\n",
error);
goto failed_ack;
@@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info
*cinfo, struct cluster_msg *cmsg)
/*up-convert CR to EX on Ack*/
error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
if (error) {
+ ret = error;
pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
error);
goto failed_ack;
@@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info
*cinfo, struct cluster_msg *cmsg)
/*down-convert EX to CR on Ack*/
error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
if (error) {
+ ret = error;
pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
error);
goto failed_ack;
@@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info
*cinfo, struct cluster_msg *cmsg)
goto failed_ack;
}
failed_message:
- return error;
+ return ret;
And I'll suggest just to change dlm_unlock_sync(), not to change all the
other places.
I have a different viewpoint, the clustermd code has been running for
about 10
years, and no bugs have been reported from SUSE customers for about 1
year. I am
inclined to keep the current code style. If we change dlm_unlock_sync(),
many
places will need to be rewritten, which may introduce new bugs. From the
callers
of this func (__sendmsg()), which handle the return value, we know it's
definitely wrong because the return value of __sendmsg() is always true.
That's not what I mean... Let me show you the code:
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index eb9bbf12c8d8..11a3c9960a22 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -744,6 +744,7 @@ static void unlock_comm(struct md_cluster_info *cinfo)
static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg
*cmsg)
{
int error;
+ int unlock_error;
int slot = cinfo->slot_number - 1;
cmsg->slot = cpu_to_le32(slot);
@@ -781,13 +782,9 @@ static int __sendmsg(struct md_cluster_info *cinfo,
struct cluster_msg *cmsg)
}
failed_ack:
- error = dlm_unlock_sync(cinfo->message_lockres);
- if (unlikely(error != 0)) {
+ while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres)))
pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n",
- error);
- /* in case the message can't be released due to some
reason */
- goto failed_ack;
- }
+ unlock_error);
failed_message:
return error;
Thanks,
Kuai
-Heming
.