> > A backoff timer can reduce retries. I don't know how you decide what > > the initial backoff should be. I was going with what seems to be the > > behavior with tcp. Maybe the backoff adjusts based on IB vs RoCE. > > Ok, I understand it now. So, with a retries of 5 and a initial timeout of 18 (~1s), > this would make: > > connect_timeout = 1 + 2 + 4 + 8 + 16 + 32 = 63s connect_timeout = initial * > (2^(retries + 1) - 1) Correct - plus random additional time added in to stagger bursts. > > > I don't think that most users needs to tune those parameters. But if > > > some use cases require a smaller connection timeout, this should be > > > available. > > > > > > I agree that finding a common ground to adjust the defaults would be > > > better but this can be challenging and break non-common fabrics or > > > use cases. > > > > IMO, if we can improve that out of the box experience, that would be ideal. > > I agree that there will always be situations where the kernel defaults > > are not optimal and either require changing them system wide, or > > possibly per rdma_cm_id. > > > > If we believe that switching to a backoff retry timer is a better > > direction or should be an option, does that change the approach for this > patch? > > A retry count still makes sense, but the timeout is more complex. On > > that note, I would specify a timeout in something straightforward, like > milliseconds. > > An exponential backoff timer seems to be a good solution to reduce temporary > contentions (when several node reconnect simultaneously). > But it makes the overall connection timeout more complex. That why you > don't want to expose the initial CM timeout to the user. > > So, if I follow you here. You suggest to expose only a "connection timeout in > ms" to the user and determine a retries count with that. Not quite. I agree with you and wouldn't go this route. I was saying *if* we expose a timeout value, that we use ms or seconds, not Infiniband Bizarre Time. The main point is to avoid exposing options that assume a linear retry timeout. > For example, if an user defines a timeout of 50s (with an initial timeout of 1s), > we should configure 4 retries. But this would make an effective timeout of 31s. > > I don't like that idea because it hides what is actually done: > A user will set a value in ms and he could have several seconds or minutes of > difference with what he expect. > > So, I would prefer the kernel TCP way. They defined "tcp_retries2" to configure > the maximum number of retransmissions for an active connection. > The initial timeout value is not configurable (TCP_RTO_MIN). And the > retransmission timeout is between TCP_RTO_MIN (200ms) and > TCP_RTO_MAX (120s). I prefer the TCP way as well, including a way to configure the system min/max timeouts in case the defaults don't work in some environment. Having a per rdma_cm_id option to change the number of retries seems reasonable. Personally, I'd like it so that apps never need to touch it. Trying to expose a timeout value is more difficult if we switch to using backoff retry timer. - Sean