RE: [PATCH rdma-next] IB/cma: Define options to set CM timeouts and retries

Sean Hefty <shefty@xxxxxxxxxx> · Tue, 9 Apr 2024 14:44:11 +0000

> > I was thinking of aligning closer with the behavior of the TCP stack, plus a
> couple other adjustments.
> >
> > a. Reduce the hard-coded CM retries from 15 down to 6.
> 15 retries is the maximum (the field size is 4 bits). According to the
> documentation that I read, there is no minimum retries required. So I
> don't know why this is statically defined to 15. Maybe this is related
> to hardware interoperability/compatibility.

15 is the max defined by the IB spec.  With a linear retry timeout, setting this
to the max can make sense.  But if we switched to using a backoff timer, I believe
we can get by with a smaller value.

> > b. Reduce the hard-coded CM response timeout from 20 (4s) to 18 (1s).
> The NVIDIA MOFED set the CM timeout to 22 (17s) instead of 20. This
> makes an overall connection timeout of 5 min for an unreachable node.
> 
> Some patches seem to argue that 20 is too short:
> https://lore.kernel.org/lkml/20190217170909.1178575-1-
> haakon.bugge@xxxxxxxxxx/

A backoff timer can reduce retries.  I don't know how you decide
what the initial backoff should be.  I was going with what seems to be the
behavior with tcp.  Maybe the backoff adjusts based on IB vs RoCE.

In any case, a 5-minute timeout seems unfriendly.

> > c. Switch CM MADs to use exponential backoff timeouts (1s, 2s, 4s, 8s, etc. +
> random variation)
> > d. Selectively send MRA responses -- only in response to a REQ
> > e. Finally, add tunables to the above options for recovery purposes.
> >
> > Most of the issues are common to RoCE and IB.  Changes a, b, & c are based
> on my system's TCP defaults, but my goal was to get timeouts down to about
> 1 minute.  Change d should help address problem 2.
> Please notes here, that we don't hit this timeout issue on Infiniband
> network with an unreachable node.
> Infiniband have a SM, rdma_resolve_route() fails before rdma_connect()
> for an unreachable node. The SM returns an empty "path record".

I guess this depends on when the node goes down and how quickly the SM
can identify it.  But this does suggest that having separate defaults for IB vs
RoCE may be necessary.

> Maybe there are some other ways to mitigate that RoCE issue.
> For example, when I troubleshooted this issue, I saw that the RoCE HCA
> received ICMP "Destination unreachable" packets for the CM requests. So,
> maybe we could listen to those messages and abort the connection process.

Hmm.. I wonder what it would take to do this.

> > If the expectation is that most users will want to change the timeout/retry,
> which I think would be the case, then adjusting the defaults may avoid the
> overhead of setting them on every cm_id.  The ability to set the values as
> proposed can substitute for change e, but may require users update their
> librdamcm.
> 
> Our particular use case is the Lustre RoCE/Infiniband module with a
> RocE network.
> Lustre relies on "pings" to monitor nodes/services/routes, this is use
> for example for High Availability or the selections of Lustre network
> routes.
> For Lustre FS, a timeout superior to 1 min is not acceptable to detect
> an unreachable node.
> 
> More information can be found here:
> https://jira.whamcloud.com/browse/LU-17480
> 
> I don't think that most users needs to tune those parameters. But if
> some use cases require a smaller connection timeout, this should be
> available.
> 
> I agree that finding a common ground to adjust the defaults would be
> better but this can be challenging and break non-common fabrics or use
> cases.

IMO, if we can improve that out of the box experience, that would be ideal.
I agree that there will always be situations where the kernel defaults are
not optimal and either require changing them system wide, or possibly 
per rdma_cm_id.

If we believe that switching to a backoff retry timer is a better direction
or should be an option, does that change the approach for this patch?
A retry count still makes sense, but the timeout is more complex.  On that
note, I would specify a timeout in something straightforward, like milliseconds.

- Sean