On Tue, Aug 11, 2020 at 01:49:31PM -0400, Mike Marciniszyn wrote: > From: Kaike Wan <kaike.wan@xxxxxxxxx> > > The following message occurs when running an AI application > with TID RDMA enabled: > > hfi1 0000:7f:00.0: hfi1_0: [QP74] hfi1_tid_timeout 4084 > hfi1 0000:7f:00.0: hfi1_0: [QP70] hfi1_tid_timeout 4084 > > The issue happens when TID RDMA WRITE request is followed by an > IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed > first on the responder side. As a result, no ACK packet for the > latter could be sent because the TID RDMA WRITE request is still > being processed on the responder side. > > When the TID RDMA WRITE request is eventually completed, the requester > will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged. > > If the next request is another TID RDMA WRITE request, no > TID RDMA WRITE DATA packet could be sent because the preceding > IB_WR_RDMA_WRITE_WITH_IMM request is not completed yet. > > Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but > it will be ignored on the responder side because the responder > thinks it has already been completed. Eventually the retry will > be exhausted and the qp will be put into error state on the requester > side. On the responder side, the TID resource timer will eventually > expire because no TID RDMA WRITE DATA packets will be received for > the second TID RDMA WRITE request. There is also risk of a > write-after-write memory corruption due to the issue. > > Fix by adding a requester side interlock to prevent any potential > data corruption and TID RDMA protocol error. > > Fixes: a0b34f75ec20 ("IB/hfi1: Add interlock between a TID RDMA request and other requests") > Cc: <stable@xxxxxxxxxxxxxxx> # 5.4.x+ > Reviewed-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx> > Reviewed-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx> > Signed-off-by: Kaike Wan <kaike.wan@xxxxxxxxx> > Signed-off-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx> > --- > drivers/infiniband/hw/hfi1/tid_rdma.c | 1 + > 1 file changed, 1 insertion(+) Applied to for-rc, thanks Jason