Namjae, after building a 6.6.0-rc2 kernel to test here at the IOLab,
I was surprised to see the smbdirect connection break during the
Connectathon "special" tests. The basic tests all work fine, but shortly
after the special tests begin, I start seeing this on the server (this
is with softRoCE, though I see similar failures over softiWarp):
[ 1266.623187] rxe0: qp#17 do_complete: non-flush error status = 2
[ 1266.623233] ksmbd: smb_direct: Recv error. status='local QP operation
error (2)' opcode=0
[ 1266.623605] ksmbd: smb_direct: disconnected
[ 1266.623610] ksmbd: sock_read failed: -107
[ 1266.628656] rxe0: qp#18 do_complete: non-flush error status = 2
[ 1266.628684] ksmbd: smb_direct: Recv error. status='local QP operation
error (2)' opcode=0
[ 1266.628820] ksmbd: smb_direct: disconnected
[ 1266.628824] ksmbd: sock_read failed: -107
[ 1266.633354] rxe0: qp#19 do_complete: non-flush error status = 2
[ 1266.633380] ksmbd: smb_direct: Recv error. status='local QP operation
error (2)' opcode=0
[ 1266.633583] ksmbd: smb_direct: disconnected
The local QP error 2 is IB_WC_LOC_QP_OP_ERR, which is a buffer error
of some sort, could be a receive buffer unavailable or maybe a length
overrun. Both of these seem highly improbable, because the "basic" tests
run fine. The client sees only a disconnection with IB_WC_REM_OP_ERR,
which is expected in this case.
OTOH it could be a client send issue, maybe a too-large datagram or an
smbdirect credit overrun. But it's the server detecting the error, so
I'm starting there for now.
This worked as recently as 6.5, definitely it was all fine in 6.4. I am
not yet able to drill down to the level of figuring out what SMB3
payload was being received by ksmbd.
Steve tells me you test over RDMA semi-often. Have you seen this?
Any ideas are welcome.
Tom.