On Tue, Jun 20, 2017 at 3:33 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: > >>>> Here the parsed output, it says that it was access to mkey which is >>>> free. > > > Missed that :) > >>>> ======== cqe_with_error ======== >>>> wqe_id : 0x0 >>>> srqn_usr_index : 0x0 >>>> byte_cnt : 0x0 >>>> hw_error_syndrome : 0x93 >>>> hw_syndrome_type : 0x0 >>>> vendor_error_syndrome : 0x52 >>> >>> >>> Can you share the check that correlates to the vendor+hw syndrome? >> >> >> mkey.free == 1 > > > Hmm, the way I understand it is that the HW is trying to access > (locally via send) a MR which was already invalidated. > > Thinking of this further, this can happen in a case where the target > already completed the transaction, sent SEND_WITH_INVALIDATE but the > original send ack was lost somewhere causing the device to retransmit > from the MR (which was already invalidated). This is highly unlikely > though. > > Shouldn't this be protected somehow by the device? > Can someone explain why the above cannot happen? Jason? Liran? Anyone? > > Say host register MR (a) and send (1) from that MR to a target, > send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE > on MR (a) and the host HCA process it, then host HCA timeout on send (1) > so it retries, but ehh, its already invalidated. > > Or, we can also have a race where we destroy all our MRs when I/O > is still running (but from the code we should be safe here). > > Robert, when you rebooted the target, I assume iscsi ping > timeout expired and the connection teardown started correct? I do remember that the ping timed out and the connection was torn down according to the messages. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html