On Mon, Apr 25, 2022 at 08:40:30PM -0500, Bob Pearson wrote: > On 4/25/22 17:58, Jason Gunthorpe wrote: > Imagine a very long RDMA read operation that times out several times before finally > getting all the data returned to the requester. Now imagine it is followed by some > small RDMA ops to a different node that use fast reg MRs and are executed by the > other node after receiving a small control message. E.g. > > node1 node2 node3 > > 1: Send: RDMA_READ(mr1 to node2) > RDMA_READ_REPLY(mr1@node1, 1of2) > ib_map_mr_sg(mr2a local) > Send: IB_WR_REG_MR(mr2a local) > Send: Control msg (mr2a to node3) > Send: RDMA_WRITE(mr2a@node1) > Send: IB_WR_LOCAL_INV(mr2a local) > ib_update_fast_reg_key(mr2a->mr2b) > ib_map_mr_sg(mr2b local) > Send: Control msg (mr2b to node3) > Send: RDMA_WRITE(mr2b@node1) > Timeout: replay from 1 (w/o local ops) > Send: RDMA_READ(mr1 to node2) > RDMA_READ_REPLY(mr1@node1, 2of2) > Send: Control msg (mr2a to node3) > Send: RDMA_WRITE(mr2a@node1) > FAILS because mr2a has been > replaced by mr2b. > On the other hand if we replay the REG_MR local command that won't work either > because we didn't know to rerun the ib_map_mr_sg() call. How did you get two destination nodes into an RC send queue? We have SRQ not SSQ. In any event, the above is a buggy ULP. The IB_WR_LOCAL_INV cannot be posted until the CQ for Send with mr2a is received. (or possibly a strong fence is used) It follows the general rule that the ULP cannot alter the data memory under a WQE until it sees the CQE for that WQE to know the NIC has completed finished with the memory. Jason