Patch "net/smc: avoid data corruption caused by decline" has been added to the 5.15-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    net/smc: avoid data corruption caused by decline

to the 5.15-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     net-smc-avoid-data-corruption-caused-by-decline.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit cf13e1fdd3ad410d8d71aa7bb4ad91b9c0e140aa
Author: D. Wythe <alibuda@xxxxxxxxxxxxxxxxx>
Date:   Wed Nov 22 10:37:05 2023 +0800

    net/smc: avoid data corruption caused by decline
    
    [ Upstream commit e6d71b437abc2f249e3b6a1ae1a7228e09c6e563 ]
    
    We found a data corruption issue during testing of SMC-R on Redis
    applications.
    
    The benchmark has a low probability of reporting a strange error as
    shown below.
    
    "Error: Protocol error, got "\xe2" as reply type byte"
    
    Finally, we found that the retrieved error data was as follows:
    
    0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
    0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
    0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2
    
    It is quite obvious that this is a SMC DECLINE message, which means that
    the applications received SMC protocol message.
    We found that this was caused by the following situations:
    
    client                  server
            ¦  clc proposal
            ------------->
            ¦  clc accept
            <-------------
            ¦  clc confirm
            ------------->
    wait llc confirm
                            send llc confirm
            ¦failed llc confirm
            ¦   x------
    (after 2s)timeout
                            wait llc confirm rsp
    
    wait decline
    
    (after 1s) timeout
                            (after 2s) timeout
            ¦   decline
            -------------->
            ¦   decline
            <--------------
    
    As a result, a decline message was sent in the implementation, and this
    message was read from TCP by the already-fallback connection.
    
    This patch double the client timeout as 2x of the server value,
    With this simple change, the Decline messages should never cross or
    collide (during Confirm link timeout).
    
    This issue requires an immediate solution, since the protocol updates
    involve a more long-term solution.
    
    Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC flow")
    Signed-off-by: D. Wythe <alibuda@xxxxxxxxxxxxxxxxx>
    Reviewed-by: Wen Gu <guwen@xxxxxxxxxxxxxxxxx>
    Reviewed-by: Wenjia Zhang <wenjia@xxxxxxxxxxxxx>
    Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 49cf523a783a2..8c11eb70c0f69 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -398,8 +398,12 @@ static int smcr_clnt_conf_first_link(struct smc_sock *smc)
 	struct smc_llc_qentry *qentry;
 	int rc;
 
-	/* receive CONFIRM LINK request from server over RoCE fabric */
-	qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME,
+	/* Receive CONFIRM LINK request from server over RoCE fabric.
+	 * Increasing the client's timeout by twice as much as the server's
+	 * timeout by default can temporarily avoid decline messages of
+	 * both sides crossing or colliding
+	 */
+	qentry = smc_llc_wait(link->lgr, NULL, 2 * SMC_LLC_WAIT_TIME,
 			      SMC_LLC_CONFIRM_LINK);
 	if (!qentry) {
 		struct smc_clc_msg_decline dclc;



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux