Re: [PATCH rdma-next] IB/cma: Define options to set CM timeouts and retries

Etienne AUJAMES <eaujames@xxxxxxx> · Tue, 9 Apr 2024 15:07:43 +0200

> I was thinking of aligning closer with the behavior of the TCP stack, plus a couple other adjustments.
> 
> a. Reduce the hard-coded CM retries from 15 down to 6.
15 retries is the maximum (the field size is 4 bits). According to the
documentation that I read, there is no minimum retries required. So I
don't know why this is statically defined to 15. Maybe this is related
to hardware interoperability/compatibility.

> b. Reduce the hard-coded CM response timeout from 20 (4s) to 18 (1s).
The NVIDIA MOFED set the CM timeout to 22 (17s) instead of 20. This
makes an overall connection timeout of 5 min for an unreachable node.

Some patches seem to argue that 20 is too short:
https://lore.kernel.org/lkml/20190217170909.1178575-1-haakon.bugge@xxxxxxxxxx/

> c. Switch CM MADs to use exponential backoff timeouts (1s, 2s, 4s, 8s, etc. + random variation)
> d. Selectively send MRA responses -- only in response to a REQ
> e. Finally, add tunables to the above options for recovery purposes.
>
> Most of the issues are common to RoCE and IB.  Changes a, b, & c are based on my system's TCP defaults, but my goal was to get timeouts down to about 1 minute.  Change d should help address problem 2.
Please notes here, that we don't hit this timeout issue on Infiniband
network with an unreachable node.
Infiniband have a SM, rdma_resolve_route() fails before rdma_connect()
for an unreachable node. The SM returns an empty "path record".

Maybe there are some other ways to mitigate that RoCE issue.
For example, when I troubleshooted this issue, I saw that the RoCE HCA
received ICMP "Destination unreachable" packets for the CM requests. So,
maybe we could listen to those messages and abort the connection process.

> If the expectation is that most users will want to change the timeout/retry, which I think would be the case, then adjusting the defaults may avoid the overhead of setting them on every cm_id.  The ability to set the values as proposed can substitute for change e, but may require users update their librdamcm.

Our particular use case is the Lustre RoCE/Infiniband module with a
RocE network.
Lustre relies on "pings" to monitor nodes/services/routes, this is use
for example for High Availability or the selections of Lustre network
routes.
For Lustre FS, a timeout superior to 1 min is not acceptable to detect
an unreachable node.

More information can be found here:
https://jira.whamcloud.com/browse/LU-17480

I don't think that most users needs to tune those parameters. But if
some use cases require a smaller connection timeout, this should be
available.

I agree that finding a common ground to adjust the defaults would be
better but this can be challenging and break non-common fabrics or use
cases.

This look like the same that for "RDMA_OPTION_ID_ACK_TIMEOUT": not all
users need to alter PacketLifeTime for RoCE, but if the network
requires to increase that value, they can do it (on Infiniband this
value is given by the SM).

Etienne