> From: Hal Rosenstock [mailto:hal@xxxxxxxxxxxxxxxxxx] > Sent: Thursday, November 16, 2017 4:13 PM > To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>; Jason Gunthorpe > <jgg@xxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > Cc: Elior, Ariel <Ariel.Elior@xxxxxxxxxx>; Amrani, Ram > <Ram.Amrani@xxxxxxxxxx>; Radzi, Amit <Amit.Radzi@xxxxxxxxxx>; Hefty, > Sean <sean.hefty@xxxxxxxxx> > Subject: Re: rstream application > > On 11/16/2017 5:39 AM, Kalderon, Michal wrote: > > Hi, > > > > We've been debugging an issue with the rstream application, would be glad > to get your help. > > This application is part of the OFA logo program and therefore we've been > debugging it. > > Intermittently we get an error: Connection refused (stale connection ) on > the second connect in the test. > > (rstream -S all -T a ) > > It looks like in some cases the server side gets a new connection > > request before destroying the cm-id, Leaving the remote id and remote > > qp in the remote_id_table and remote_qp_table > > The connection goes into timewait state on disconnect. This timeout is 2 > * PathRecord:PacketLifeTime plus the remote's Ack Delay. > > RoCE spec says "The default value for SubnetTimeout shall be 18 and can be > modified by Ethernet management practices.The default SubnetTimeout > value can be used as an upper bound estimate of InfiniBand PacketLifeTime". > > A subsequent incoming REQ for same remote ID and remote QPN while in > this state will result in REJ for stale connection. This is at CM and CMA levels. I > am not sure whether or not the errors returned from rconnect are sufficient > to isolate this case from other connection refusal cases. Sean would know. > > I think there is same issue in some other rsocket examples as well. In this case the application only knows it is rejected and not that it is stale. But perhaps the application can retry several times to re-connect if it gets an error in anycase? (similar to perftest for instance) Sean, Hal, do you agree with this direction ? Specifically for the rstream application, it seems that if close event is called before the Request-handler got a chance to process the new request, the function cm_enter_timewait is called and removes the remote_id. And remote_qp from the tables. In the bad case, cm_enter_timewait is called later on. So it all falls down to a timing issue in whether the close event on the socket was called on the server before the new request was called. > > -- Hal thanks > > > Attached are two traces (using ftrace) good_trace when second connect > > succeeds bad_trace when second connect attempt fails > > > > I think this can be considered as an application issue, and rstream > > could be modified to try and re-connect In case it fails. > > > > Your input on this will be highly appreciated, > > > > Thanks, > > Michal > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html