RE: [PATCH rdma-next] IB/cma: Define options to set CM timeouts and retries

Sean Hefty <shefty@xxxxxxxxxx> · Thu, 11 Apr 2024 17:15:39 +0000

> > A backoff timer can reduce retries.  I don't know how you decide what
> > the initial backoff should be.  I was going with what seems to be the
> > behavior with tcp.  Maybe the backoff adjusts based on IB vs RoCE.
> 
> Ok, I understand it now. So, with a retries of 5 and a initial timeout of 18 (~1s),
> this would make:
> 
> connect_timeout = 1 + 2 + 4 + 8 + 16 + 32 = 63s connect_timeout = initial *
> (2^(retries + 1) - 1)

Correct - plus random additional time added in to stagger bursts.

> > > I don't think that most users needs to tune those parameters. But if
> > > some use cases require a smaller connection timeout, this should be
> > > available.
> > >
> > > I agree that finding a common ground to adjust the defaults would be
> > > better but this can be challenging and break non-common fabrics or
> > > use cases.
> >
> > IMO, if we can improve that out of the box experience, that would be ideal.
> > I agree that there will always be situations where the kernel defaults
> > are not optimal and either require changing them system wide, or
> > possibly per rdma_cm_id.
> >
> > If we believe that switching to a backoff retry timer is a better
> > direction or should be an option, does that change the approach for this
> patch?
> > A retry count still makes sense, but the timeout is more complex.  On
> > that note, I would specify a timeout in something straightforward, like
> milliseconds.
> 
> An exponential backoff timer seems to be a good solution to reduce temporary
> contentions (when several node reconnect simultaneously).
> But it makes the overall connection timeout more complex. That why you
> don't want to expose the initial CM timeout to the user.
> 
> So, if I follow you here. You suggest to expose only a "connection timeout in
> ms" to the user and determine a retries count with that.

Not quite.  I agree with you and wouldn't go this route.

I was saying *if* we expose a timeout value, that we use ms or seconds, not Infiniband Bizarre Time.

The main point is to avoid exposing options that assume a linear retry timeout.

> For example, if an user defines a timeout of 50s (with an initial timeout of 1s),
> we should configure 4 retries. But this would make an effective timeout of 31s.
> 
> I don't like that idea because it hides what is actually done:
> A user will set a value in ms and he could have several seconds or minutes of
> difference with what he expect.
> 
> So, I would prefer the kernel TCP way. They defined "tcp_retries2" to configure
> the maximum number of retransmissions for an active connection.
> The initial timeout value is not configurable (TCP_RTO_MIN). And the
> retransmission timeout is between TCP_RTO_MIN (200ms) and
> TCP_RTO_MAX (120s).

I prefer the TCP way as well, including a way to configure the system min/max timeouts in case the defaults don't work in some environment.  Having a per rdma_cm_id option to change the number of retries seems reasonable.  Personally, I'd like it so that apps never need to touch it.  Trying to expose a timeout value is more difficult if we switch to using backoff retry timer.

- Sean