On Tue, Jul 28, 2020 at 02:38:48PM -0400, Mike Marciniszyn wrote: > The lookaside count is improperly initialized to the size of the > Receive Queue with the additional +1. In the traces below, the > RQ size is 384, so the count was set to 385. > > The lookaside count is then rarely refreshed. Note the high and > incorrect count in the trace below: > > rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9008 wr_id 55c7206d75a0 qpn c > qpt 2 pid 3018 num_sge 1 head 1 tail 0, count 385 > rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1 > > The head,tail indicate there is only one RWQE posted although the count > says 385 and we correctly return the element 0. > > The next call to rvt_get_rwqe with the decremented count: > > rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9058 wr_id 0 qpn c > qpt 2 pid 3018 num_sge 0 head 1 tail 1, count 384 > rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1 > > Note that the RQ is empty (head == tail) yet we return the RWQE at tail 1, > which is not valid because of the bogus high count. > > Best case, the RWQE has never been posted and the rc logic sees an RWQE > that is too small (all zeros) and puts the QP into an error state. > > In the worst case, a server slow at posting receive buffers might fool > rvt_get_rwqe() into fetching an old RWQE and corrupt memory. > > Fix by deleting the faulty initialization code and creating an > inline to fetch the posted count and convert all callers to use > new inline. > > Fixes: f592ae3c999f ("IB/rdmavt: Fracture single lock used for posting and processing RWQEs") Confirmed this patch works for me. Thanks Tested-by: Honggang Li <honli@xxxxxxxxxx>