Hello, Neal! Thanks for your reply and explanations. I agree with all your points, about safe defaults for both timeouts and the number of retries. But what the patch does is not changing the defaults, it only provides a way to work with these values through bpf, which is important in an environment that is way different from cellular networks. For example in the modern DC the rto_min value should correspond with real RTT, that definitely not 200ms. Also I add Alexander Azimov for further discussions. -- Dmitry > On 23 Jul 2021, at 17:41, Neal Cardwell <ncardwell@xxxxxxxxxx> wrote: > > .(On Fri, Jul 23, 2021 at 5:41 AM Dmitry Yakunin <zeil@xxxxxxxxxxxxxx> wrote: >> >> Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt") >> adds ability to set rto_min value on socket less then default TCP_RTO_MIN. >> But retransmits_timed_out() function still uses TCP_RTO_MIN and >> tcp_retries{1,2} sysctls don't work properly for tuned socket values. >> >> Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt") >> Signed-off-by: Dmitry Yakunin <zeil@xxxxxxxxxxxxxx> >> Acked-by: Dmitry Monakhov <dmtrmonakhov@xxxxxxxxxxxxxx> >> --- >> net/ipv4/tcp_timer.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c >> index 20cf4a9..66c4b97 100644 >> --- a/net/ipv4/tcp_timer.c >> +++ b/net/ipv4/tcp_timer.c >> @@ -199,12 +199,13 @@ static unsigned int tcp_model_timeout(struct sock *sk, >> * @boundary: max number of retransmissions >> * @timeout: A custom timeout value. >> * If set to 0 the default timeout is calculated and used. >> - * Using TCP_RTO_MIN and the number of unsuccessful retransmits. >> + * Using icsk_rto_min value from socket or RTAX_RTO_MIN from route >> + * and the number of unsuccessful retransmits. >> * >> * The default "timeout" value this function can calculate and use >> * is equivalent to the timeout of a TCP Connection >> * after "boundary" unsuccessful, exponentially backed-off >> - * retransmissions with an initial RTO of TCP_RTO_MIN. >> + * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN. >> */ >> static bool retransmits_timed_out(struct sock *sk, >> unsigned int boundary, >> @@ -217,7 +218,7 @@ static bool retransmits_timed_out(struct sock *sk, >> >> start_ts = tcp_sk(sk)->retrans_stamp; >> if (likely(timeout == 0)) { >> - unsigned int rto_base = TCP_RTO_MIN; >> + unsigned int rto_base = tcp_rto_min(sk); >> >> if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) >> rto_base = tcp_timeout_init(sk); >> -- > > I would argue strenuously against this. We tried the approach in this > patch at Google years ago, but we had to revert that approach and go > back to using TCP_RTO_MIN as the baseline for the computation, because > using the custom tcp_rto_min(sk) caused serious reliability problems. > > The behavior in this patch causes various serious reliability problems > because the retransmits_timed_out() computation is used for various > timeout decisions that determine how long a connection tries to > retransmit something before deciding the path is bad and/or giving up > and closing the connection. Here are a few of the problems this > causes: > > (1) The biggest one is probably orphan retries. By default > tcp_orphan_retries() uses a retry count of 8. But if your min_rto is > 5ms (used at Google for many years), then the 8 retries means an > orphaned connection (whose fd is no longer held by a process, but is > still established) only lasts for 1.275 seconds before giving up and > closing. This means that connectivity problems longer than 1.275 > seconds (extremely common with cellular links) are not tolerated for > such connections; the connections often do not receive the data they > were supposed to receive. > > (2) TCP_RETR1 /sysctl_tcp_retries1, used for __dst_negative_advice(), > also has big problems. Even with a min_rto as big as 20ms, on a route > with 150ms RTT, the approach in this patch will cause > retransmits_timed_out() to return true upon the 1st RTO timer firing, > even though TCP_RETR1 is 3. > > (3) TCP_RETR2 /sysctl_tcp_retries2, with a default of 15, used for > regular connection retry lifetimes, also has a massive decrease in > robustness, due to falling from 109 minutes with a 200ms RTO, to > about 2.7 minutes with a min_rto of 5ms. > > neal