Nice catch, Haakon! -Jack On Tue, 20 Jun 2017 14:07:50 +0200 Håkon Bugge <Haakon.Bugge@xxxxxxxxxx> wrote: > CM REQs cannot be successfully retried, because a new pv_cm_id is > created for each request, without checking if one already exists. > > By checking if an id exists before creating one, the bug is fixed. > > This bug can be provoked by running an RDMA CM user-land application, > but inserting a five seconds delay before the rdma_accept() call on > the passive side. This delay is larger than the default CMA timeout, > and triggers a retry from the active side. The retried REQ will use > another pv_cm_id (the cm_id on the wire). This confuses the CM > protocol and two REJs are sent from the passive side. > > Here is an excerpt from ibdump running without the patch: > > 3.285092 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP > Hello) 7.382711 LID: 4 -> LID: 4 SDP 290 CM: > ConnectRequest(SDP Hello) 7.382861 LID: 4 -> LID: 4 > InfiniBand 290 CM: ConnectReject 7.387644 LID: 4 -> LID: > 4 InfiniBand 290 CM: ConnectReject > > and here is the same with bug fix applied: > > 3.251010 LID: 4 -> LID: 4 SDP 290 CM: ConnectRequest(SDP > Hello) 7.349387 LID: 4 -> LID: 4 SDP 290 CM: > ConnectRequest(SDP Hello) 8.258443 LID: 4 -> LID: 4 SDP > 290 CM: ConnectReply(SDP Hello) 8.259890 LID: 4 -> LID: 4 > InfiniBand 290 CM: ReadyToUse > > Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra@xxxxxxxxxx> > Signed-off-by: Håkon Bugge <haakon.bugge@xxxxxxxxxx> > Reported-by: Wei Lin Guay <wei.lin.guay@xxxxxxxxxx> > Tested-by: Wei Lin Guay <wei.lin.guay@xxxxxxxxxx> > Reviewed-by: Yuval Shaia <yuval.shaia@xxxxxxxxxx> Acked-by: Jack Morgenstein <jackm@xxxxxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html