Patch "md-cluster: fix hanging issue while a new disk adding" has been added to the 6.10-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    md-cluster: fix hanging issue while a new disk adding

to the 6.10-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     md-cluster-fix-hanging-issue-while-a-new-disk-adding.patch
and it can be found in the queue-6.10 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 6e581e828c65ad577a7499bd902220d6e07fea93
Author: Heming Zhao <heming.zhao@xxxxxxxx>
Date:   Tue Jul 9 18:41:19 2024 +0800

    md-cluster: fix hanging issue while a new disk adding
    
    [ Upstream commit fff42f213824fa434a4b6cf906b4331fe6e9302b ]
    
    The commit 1bbe254e4336 ("md-cluster: check for timeout while a
    new disk adding") is correct in terms of code syntax but not
    suite real clustered code logic.
    
    When a timeout occurs while adding a new disk, if recv_daemon()
    bypasses the unlock for ack_lockres:CR, another node will be waiting
    to grab EX lock. This will cause the cluster to hang indefinitely.
    
    How to fix:
    
    1. In dlm_lock_sync(), change the wait behaviour from forever to a
       timeout, This could avoid the hanging issue when another node
       fails to handle cluster msg. Another result of this change is
       that if another node receives an unknown msg (e.g. a new msg_type),
       the old code will hang, whereas the new code will timeout and fail.
       This could help cluster_md handle new msg_type from different
       nodes with different kernel/module versions (e.g. The user only
       updates one leg's kernel and monitors the stability of the new
       kernel).
    2. The old code for __sendmsg() always returns 0 (success) under the
       design (must successfully unlock ->message_lockres). This commit
       makes this function return an error number when an error occurs.
    
    Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
    Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
    Reviewed-by: Su Yue <glass.su@xxxxxxxx>
    Acked-by: Yu Kuai <yukuai3@xxxxxxxxxx>
    Signed-off-by: Song Liu <song@xxxxxxxxxx>
    Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@xxxxxxxx
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 8e36a0feec098..b5a802ae17bb2 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -15,6 +15,7 @@
 
 #define LVB_SIZE	64
 #define NEW_DEV_TIMEOUT 5000
+#define WAIT_DLM_LOCK_TIMEOUT (30 * HZ)
 
 struct dlm_lock_resource {
 	dlm_lockspace_t *ls;
@@ -130,8 +131,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
 			0, sync_ast, res, res->bast);
 	if (ret)
 		return ret;
-	wait_event(res->sync_locking, res->sync_locking_done);
+	ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
+				WAIT_DLM_LOCK_TIMEOUT);
 	res->sync_locking_done = false;
+	if (!ret) {
+		pr_err("locking DLM '%s' timeout!\n", res->name);
+		return -EBUSY;
+	}
 	if (res->lksb.sb_status == 0)
 		res->mode = mode;
 	return res->lksb.sb_status;
@@ -743,7 +749,7 @@ static void unlock_comm(struct md_cluster_info *cinfo)
  */
 static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 {
-	int error;
+	int error, unlock_error;
 	int slot = cinfo->slot_number - 1;
 
 	cmsg->slot = cpu_to_le32(slot);
@@ -751,7 +757,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 	error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
 	if (error) {
 		pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
-		goto failed_message;
+		return error;
 	}
 
 	memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg,
@@ -781,14 +787,10 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 	}
 
 failed_ack:
-	error = dlm_unlock_sync(cinfo->message_lockres);
-	if (unlikely(error != 0)) {
+	while ((unlock_error = dlm_unlock_sync(cinfo->message_lockres)))
 		pr_err("md-cluster: failed convert to NL on MESSAGE(%d)\n",
-			error);
-		/* in case the message can't be released due to some reason */
-		goto failed_ack;
-	}
-failed_message:
+			unlock_error);
+
 	return error;
 }
 




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux