On 11/15/23 10:06 PM, Wenjia Zhang wrote:
On 13.11.23 04:44, Dust Li wrote:
On Wed, Nov 08, 2023 at 05:48:29PM +0800, D. Wythe wrote:
From: "D. Wythe" <alibuda@xxxxxxxxxxxxxxxxx>
We found a data corruption issue during testing of SMC-R on Redis
applications.
The benchmark has a low probability of reporting a strange error as
shown below.
"Error: Protocol error, got "\xe2" as reply type byte"
Finally, we found that the retrieved error data was as follows:
0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2
It is quite obvious that this is a SMC DECLINE message, which means
that
the applications received SMC protocol message.
We found that this was caused by the following situations:
client server
proposal
------------->
accept
<-------------
confirm
------------->
wait confirm
failed llc confirm
x------
(after 2s)timeout
wait rsp
wait decline
(after 1s) timeout
(after 2s) timeout
decline
-------------->
decline
<--------------
As a result, a decline message was sent in the implementation, and this
message was read from TCP by the already-fallback connection.
This patch double the client timeout as 2x of the server value,
With this simple change, the Decline messages should never cross or
collide (during Confirm link timeout).
This issue requires an immediate solution, since the protocol updates
involve a more long-term solution.
Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC
flow")
Signed-off-by: D. Wythe <alibuda@xxxxxxxxxxxxxxxxx>
---
net/smc/af_smc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index abd2667..5b91f55 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -599,7 +599,7 @@ static int smcr_clnt_conf_first_link(struct
smc_sock *smc)
int rc;
/* receive CONFIRM LINK request from server over RoCE fabric */
- qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME,
+ qentry = smc_llc_wait(link->lgr, NULL, 2 * SMC_LLC_WAIT_TIME,
SMC_LLC_CONFIRM_LINK);
It may be difficult for people to understand why LLC_WAIT_TIME is
different, especially without any comments explaining its purpose.
People are required to use git to find the reason, which I believe is
not conducive to easy maintenance.
Best regards,
Dust
Good point! @D.Wythe, could you please try to add a simple commet to
explain it?
Also good to me, i will add comment to explain it.
D. Wythe
Thanks,
Wenjia
if (!qentry) {
struct smc_clc_msg_decline dclc;
--
1.8.3.1