RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

Andrew Klaassen <andrew.klaassen@xxxxxxxxxxxxxx> · Thu, 26 Jan 2023 22:08:02 +0000

> From: Andrew Klaassen <andrew.klaassen@xxxxxxxxxxxxxx>
> Sent: Thursday, January 26, 2023 10:32 AM
> 
> > From: Andrew Klaassen <andrew.klaassen@xxxxxxxxxxxxxx>
> > Sent: Monday, January 23, 2023 11:31 AM
> >
> > Hello,
> >
> > There's a specific NFSv4 mount on a specific machine which we'd like
> > to timeout and return an error after a few seconds if the server goes away.
> >
> > I've confirmed the following on two different kernels, 4.18.0-
> > 348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.
> >
> > I've been able to get both autofs and the mount command to cooperate,
> > so that the mount attempt fails after an arbitrary number of seconds.
> > This mount command, for example, will fail after 6 seconds, as
> > expected based on the timeo=20,retrans=2,retry=0 options:
> >
> > $ time sudo mount -t nfs4 -o
> > rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmi
> > n
> >
> =0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,retra
> > n s=2,retry=0,sec=sys thor04:/mnt/thorfs04  /mnt/thor04
> > mount.nfs4: Connection timed out
> >
> > real    0m6.084s
> > user    0m0.007s
> > sys     0m0.015s
> >
> > However, if the share is already mounted and the server goes away, the
> > timeout is always 2 minutes plus the time I expect based on timeo and
> > retrans.  In this case, 2 minutes and 6 seconds:
> >
> > $ time ls /mnt/thor04
> > ls: cannot access '/mnt/thor04': Connection timed out
> >
> > real    2m6.025s
> > user    0m0.003s
> > sys     0m0.000s
> >
> > Watching the outgoing packets in the second case, the pattern is
> > always the
> > same:
> >  - 0.2 seconds between the first two, then doubling each time until
> > the two minute mark is exceeded (so the last NFS packet, which is
> > always the 11th packet, is sent around 1:45 after the first).
> >  - Then some generic packets that start exactly-ish on the two minute
> > mark, 1 second between the first two, then doubling each time.  (By
> > this time the NFS command has given up.)
> >
> > 11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.], seq
> > 14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306 ecr
> > 1589769203], length 200: NFS request xid 3614904256 196 getattr fh
> > 0,2/53
> > 11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490 ecr
> > 0,nop,wscale 7], length 0
> > 11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874 ecr
> > 0,nop,wscale 7], length 0
> > 11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S], seq
> > 1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130 ecr
> > 0,nop,wscale 7], length 0
> >
> > I tried changing tcp_retries2 as suggested in another thread from this list:
> >
> > # echo 3 > /proc/sys/net/ipv4/tcp_retries2
> >
> > ...but it made no difference on either kernel.  The 2 minute timeout
> > also doesn't seem to match with what I'd calculate from the initial
> > value of tcp_retries2, which should give a much higher timeout.
> >
> > The only clue I've been able to find is in the retry=n entry in the
> > NFS
> > manpage:
> >
> > " For TCP the default is 3 minutes, but system TCP connection timeouts
> > will sometimes limit the timeout of each retransmission to around 2
> minutes."
> >
> > What I'm not able to make sense of:
> >  - The retry option says that it applies to mount operations, not
> > read/write operations.  However, in this case I'm seeing the 2 minute
> > delay on read/write operations but *not* mount operations.
> >  - A couple of hours of searching didn't lead me to any kernel
> > settings that would result in a 2 minute timeout.
> >
> > Does anyone have any clues about a) what's happening and b) how to get
> > our desired behaviour of being able to control both mount and
> > read/write timeouts down to a few seconds?
> >
> > Thanks.
> 
> I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to
> something smaller and recompiling the kernel would change the 2 minute
> timeout, but it had no effect.  I'm going to keep poking through the kernel
> code to see if there's a knob I can turn to change the 2 minute timeout, so
> that I can at least understand where it's coming from.
> 
> Any hints as to where I should be looking?

I believe I've made some progress with this today:

 - Calls to rpc_create() from fs/nfs/client.c are sending an rpc_timeout struct with their args.
 - rpc_create() does *not* pass the timeout on to xprt_create_transport(), which then can't pass it on to xs_setup_tcp().
 - xs_setup_tcp(), having no timeout passed to it, uses xs_tcp_default_timeout instead.
 - changing xs_tcp_default_timeout changes the "ls" timeout behaviour I described above.

In theory all of this means that the timeout simply needs to be passed through and used instead of xs_tcp_default_timeout.  I'm going to give this a try tomorrow.

Here's what I'm going to try first; I'm no C programmer, though, so any advice or corrections you might have would be appreciated.

Thanks.

Andrew

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
                .addrlen = args->addrsize,
                .servername = args->servername,
                .bc_xprt = args->bc_xprt,
+               .timeout = args->timeout,
        };
        char servername[48];
        struct rpc_clnt *clnt;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..adc79d94b59e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
        xprt->idle_timeout = XS_IDLE_DISC_TO;

        xprt->ops = &xs_tcp_ops;
-       xprt->timeout = &xs_tcp_default_timeout;
+       xprt->timeout = args->timeout;

        xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
        xprt->connect_timeout = xprt->timeout->to_initval *