> On Oct 3, 2023, at 11:28 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > > On Tue, Oct 3, 2023 at 11:12 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: >> >> On Tue, 2023-10-03 at 10:44 -0400, Olga Kornievskaia wrote: >>> On Sat, Sep 30, 2023 at 7:06 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >>>> >>>> On Sat, 2023-09-30 at 18:36 -0400, Olga Kornievskaia wrote: >>>>> On Fri, Sep 29, 2023 at 10:57 PM Trond Myklebust >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Thu, 2023-09-28 at 10:58 -0400, Olga Kornievskaia wrote: >>>>>>> On Wed, Sep 27, 2023 at 3:35 PM <trondmy@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>>> >>>>>>>> If the TCP connection attempt fails without ever establishing a >>>>>>>> connection, then assume the problem may be the server is >>>>>>>> rejecting >>>>>>>> us >>>>>>>> due to port reuse. >>>>>>> >>>>>>> Doesn't this break 4.0 replay cache? Seems too general to assume >>>>>>> that >>>>>>> any unsuccessful SYN was due to a server reboot and it's ok for >>>>>>> the >>>>>>> client to change the port. >>>>>> >>>>>> This is where things get interesting. Yes, if we change the port >>>>>> number, then it will almost certainly break NFSv3 and NFSv4.0 >>>>>> replay >>>>>> caching on the server. >>>>>> >>>>>> However the problem is that once we get stuck in the situation >>>>>> where we >>>>>> cannot connect, then each new connection attempt is just causing >>>>>> the >>>>>> server's TCP layer to push back and recall that the connection from >>>>>> this port was closed. >>>>>> IOW: the problem is that once we're in this situation, we cannot >>>>>> easily >>>>>> exit without doing one of the following. Either we have to >>>>>> >>>>>> 1. Change the port number, so that the TCP layer allows us to >>>>>> connect. >>>>>> 2. Or.. Wait for long enough that the TCP layer has forgotten >>>>>> altogether about the previous connection. >>>>>> >>>>>> The problem is that option (2) is subject to livelock, and so has a >>>>>> potential infinite time out. I've seen this livelock in action, and >>>>>> I'm >>>>>> not seeing a solution that has predictable results. >>>>>> >>>>>> So unless there is a solution for the problems in (2), I don't see >>>>>> how >>>>>> we can avoid defaulting to option (1) at some point, in which case >>>>>> the >>>>>> only question is "when do we switch ports?". >>>>> >>>>> I'm not sure how one can justify that regression that will come out >>>>> of >>>>> #1 will be less of a problem then the problem in #2. >>>>> >>>>> I think I'm still not grasping why the NFS server would >>>>> (legitimately) >>>>> be closing a connection that is re-using the port. Can you present a >>>>> sequence of events that would lead to this? >>>>> >>>> >>>> Yes. It is essentially the problem described in this blog: >>>> https://blog.davidvassallo.me/2010/07/13/time_wait-and-port-reuse/ >>>> >>>> ...and as you can see, it is nothing to do with NFS. This is the TCP >>>> protocol working as expected. >>> >>> What I'm seeing are statements that RFC allows for/provides guidance >>> on how to transition out of TIME_WAIT state. I'm also hearing that the >>> reasons that the server can't allow for port reuse is due to broken >>> client implementation or use of (broken?) NAT implementation. >>> >>> I don't see how any of this justifies allowing a regression in the >>> linux client code. I'm clearly missing something. How are you possibly >>> OK with breaking the reply cache? >>> >> >> Is it really breaking things though if you can't connect otherwise? Bear >> in mind that if you're dealing with NAT'ed setup, and you wait until the >> connection is completely forgotten, then the NAT'ing firewall is likely >> to change your source port anyway. >> >> Chuck brought up an interesting question privately: should knfsd's >> v3/v4.0 DRC start ignoring the source port? We already check this >> otherwise: >> >> - IP addr >> - XID >> - hash of first 256 bytes of the payload > > Calculating a hash of every packet has a great performance impact. NFSD has done this for years. On modern CPUs, it's less of a performance hit than walking the DRC hash chain. > But > perhaps if we require v3 traffic to do that then we can get v3 and > v4.1 performance on the same level and folks would finally switch to > v4.1. > > I also forgot to write that while we don't care about port (not being > reused) for 4.1. If we switch the port on every connection > establishment we will same way run into port exhaustion. Use of > nconnect is becoming common so something like a server reboot on a > client machine with about only 10 mounts using nconnect=16 and average > of 7 SYNs (as per example here) before the server starts again, the > client would use 1K source ports? > >> That seems like enough discriminators that we could stop comparing the >> source port without breaking things. >> >>>>> But can't we at least arm ourselves in not unnecessarily breaking the >>>>> reply cache by at least imposing some timeout/number of retries >>>>> before >>>>> resetting? If the client was retrying to unsuccessfully re-establish >>>>> connection for a (fixed) while, then 4.0 client's lease would expire >>>>> and switching the port after the lease expires makes no difference. >>>>> There isn't a solution in v3 unfortunately. But a time-based approach >>>>> would at least separate these 'peculiar' servers vs normal servers. >>>>> And if this is a 4.1 client, we can reset the port without a timeout. >>>>> >>>> >>>> This is not a 'peculiar server' vs 'normal server' problem. The reuse >>>> of ports in this way violates the TCP protocol, and has been a problem >>> >>> I disagree here. Even the RFC quoted by the blogger says that reuse of >>> port is allowed. >>> >>>> for NFS/TCP since the beginning. However, it was never a problem for >>>> the older connectionless UDP protocol, which is where the practice of >>>> tying the replay cache to the source port began in the first place. >>>> >>>> NFSv4.1 does not have this problem because it deliberately does not >>>> reuse TCP ports, and the reason is precisely to avoid the TIME_WAIT >>>> state problems. >>>> >>>> NFSv3 tries to avoid it by doing an incremental back off, but we >>>> recently saw that does not suffice to avoid live lock, after a system >>>> got stuck for several hours in this state. >>>> >>>>> Am I correct that every unsuccessful SYN causes a new source point to >>>>> be taken? If so, then a server reboot where multiple SYNs are sent >>>>> prior to connection re-establishment (times number of mounts) might >>>>> cause source port exhaustion? >>>>> >>>> >>>> No. Not every unsuccessful SYN: It is every unsuccessful sequence of >>> >>> I disagree. Here's a snippet of the network trace with the proposed >>> patch. The port is changed on EVERY unsuccessful SYN. >>> >>> 76 2023-10-03 10:17:04.285731 192.168.1.134 → 192.168.1.106 NFS 238 >>> V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC >>> 77 2023-10-03 10:17:04.328371 192.168.1.106 → 192.168.1.134 TCP 66 >>> 2049 → 909 [ACK] Seq=1113 Ack=1501 Win=31872 Len=0 TSval=3542359002 >>> TSecr=3081600630 >>> 256 2023-10-03 10:18:04.341041 192.168.1.134 → 192.168.1.106 TCP 66 >>> [TCP Keep-Alive] 909 → 2049 [ACK] Seq=1500 Ack=1113 Win=32000 Len=0 >>> TSval=3081660681 TSecr=3542359002 >>> 259 2023-10-03 10:18:04.341500 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 909 [RST] Seq=1113 Win=0 Len=0 >>> 260 2023-10-03 10:18:04.341860 192.168.1.134 → 192.168.1.106 TCP 74 >>> [TCP Port numbers reused] 909 → 2049 [SYN] Seq=0 Win=32120 Len=0 >>> MSS=1460 SACK_PERM TSval=3081660681 TSecr=0 WS=128 >>> 261 2023-10-03 10:18:04.342031 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 909 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 266 2023-10-03 10:18:07.380801 192.168.1.134 → 192.168.1.106 TCP 74 >>> 954 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081663720 TSecr=0 WS=128 >>> 267 2023-10-03 10:18:07.380971 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 954 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 275 2023-10-03 10:18:10.423352 192.168.1.134 → 192.168.1.106 TCP 74 >>> 856 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081666760 TSecr=0 WS=128 >>> 276 2023-10-03 10:18:10.423621 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 856 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 286 2023-10-03 10:18:13.466277 192.168.1.134 → 192.168.1.106 TCP 74 >>> 957 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081669801 TSecr=0 WS=128 >>> 287 2023-10-03 10:18:13.466812 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 957 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 289 2023-10-03 10:18:16.509229 192.168.1.134 → 192.168.1.106 TCP 74 >>> 695 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081672841 TSecr=0 WS=128 >>> 290 2023-10-03 10:18:16.509845 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 695 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 294 2023-10-03 10:18:19.551062 192.168.1.134 → 192.168.1.106 TCP 74 >>> 940 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081675881 TSecr=0 WS=128 >>> 295 2023-10-03 10:18:19.551434 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 940 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 300 2023-10-03 10:18:22.590380 192.168.1.134 → 192.168.1.106 TCP 74 >>> 810 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081678921 TSecr=0 >>> WS=128 >>> 301 2023-10-03 10:18:22.590726 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 810 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 308 2023-10-03 10:18:25.628256 192.168.1.134 → 192.168.1.106 TCP 74 >>> 877 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081681961 TSecr=0 WS=128 >>> 309 2023-10-03 10:18:25.628724 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 877 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 312 2023-10-03 10:18:28.665682 192.168.1.134 → 192.168.1.106 TCP 74 >>> 934 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081685001 TSecr=0 WS=128 >>> 313 2023-10-03 10:18:28.666374 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 934 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 320 2023-10-03 10:18:31.702236 192.168.1.134 → 192.168.1.106 TCP 74 >>> 803 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081688040 TSecr=0 WS=128 >>> 321 2023-10-03 10:18:31.702490 192.168.1.106 → 192.168.1.134 TCP 74 >>> 2049 → 803 [SYN, ACK] Seq=0 Ack=1 Win=31856 Len=0 MSS=1460 SACK_PERM >>> TSval=1993141756 TSecr=3081688040 WS=128 >>> 322 2023-10-03 10:18:31.702729 192.168.1.134 → 192.168.1.106 TCP 66 >>> 803 → 2049 [ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=3081688040 >>> TSecr=1993141756 >>> 323 2023-10-03 10:18:31.702737 192.168.1.134 → 192.168.1.106 NFS 238 >>> V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC >>> 324 2023-10-03 10:18:31.702893 192.168.1.106 → 192.168.1.134 TCP 66 >>> 2049 → 803 [ACK] Seq=1 Ack=173 Win=31872 Len=0 TSval=1993141756 >>> TSecr=3081688040 >>> 749 2023-10-03 10:19:01.880214 192.168.1.106 → 192.168.1.134 NFS 206 >>> V3 WRITE Reply (Call In 323) Len: 4 FILE_SYNC >>> >>> This is the same without the patch. Port is successfully reused. >>> Replay cache OK here not above. >>> >>> 76 2023-10-03 10:17:04.285731 192.168.1.134 → 192.168.1.106 NFS 238 >>> V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC >>> 77 2023-10-03 10:17:04.328371 192.168.1.106 → 192.168.1.134 TCP 66 >>> 2049 → 909 [ACK] Seq=1113 Ack=1501 Win=31872 Len=0 TSval=3542359002 >>> TSecr=3081600630 >>> 256 2023-10-03 10:18:04.341041 192.168.1.134 → 192.168.1.106 TCP 66 >>> [TCP Keep-Alive] 909 → 2049 [ACK] Seq=1500 Ack=1113 Win=32000 Len=0 >>> TSval=3081660681 TSecr=3542359002 >>> 259 2023-10-03 10:18:04.341500 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 909 [RST] Seq=1113 Win=0 Len=0 >>> 260 2023-10-03 10:18:04.341860 192.168.1.134 → 192.168.1.106 TCP 74 >>> [TCP Port numbers reused] 909 → 2049 [SYN] Seq=0 Win=32120 Len=0 >>> MSS=1460 SACK_PERM TSval=3081660681 TSecr=0 WS=128 >>> 261 2023-10-03 10:18:04.342031 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 909 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 266 2023-10-03 10:18:07.380801 192.168.1.134 → 192.168.1.106 TCP 74 >>> 954 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081663720 TSecr=0 WS=128 >>> 267 2023-10-03 10:18:07.380971 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 954 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 275 2023-10-03 10:18:10.423352 192.168.1.134 → 192.168.1.106 TCP 74 >>> 856 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081666760 TSecr=0 WS=128 >>> 276 2023-10-03 10:18:10.423621 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 856 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 286 2023-10-03 10:18:13.466277 192.168.1.134 → 192.168.1.106 TCP 74 >>> 957 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081669801 TSecr=0 WS=128 >>> 287 2023-10-03 10:18:13.466812 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 957 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 289 2023-10-03 10:18:16.509229 192.168.1.134 → 192.168.1.106 TCP 74 >>> 695 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081672841 TSecr=0 WS=128 >>> 290 2023-10-03 10:18:16.509845 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 695 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 294 2023-10-03 10:18:19.551062 192.168.1.134 → 192.168.1.106 TCP 74 >>> 940 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081675881 TSecr=0 WS=128 >>> 295 2023-10-03 10:18:19.551434 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 940 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 300 2023-10-03 10:18:22.590380 192.168.1.134 → 192.168.1.106 TCP 74 >>> 810 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081678921 TSecr=0 WS=128 >>> 301 2023-10-03 10:18:22.590726 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 810 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 308 2023-10-03 10:18:25.628256 192.168.1.134 → 192.168.1.106 TCP 74 >>> 877 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081681961 TSecr=0 WS=128 >>> 309 2023-10-03 10:18:25.628724 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 877 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 312 2023-10-03 10:18:28.665682 192.168.1.134 → 192.168.1.106 TCP 74 >>> 934 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081685001 TSecr=0 WS=128 >>> 313 2023-10-03 10:18:28.666374 192.168.1.106 → 192.168.1.134 TCP 54 >>> 2049 → 934 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 >>> 320 2023-10-03 10:18:31.702236 192.168.1.134 → 192.168.1.106 TCP 74 >>> 803 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM >>> TSval=3081688040 TSecr=0 WS=128 >>> 321 2023-10-03 10:18:31.702490 192.168.1.106 → 192.168.1.134 TCP 74 >>> 2049 → 803 [SYN, ACK] Seq=0 Ack=1 Win=31856 Len=0 MSS=1460 SACK_PERM >>> TSval=1993141756 TSecr=3081688040 WS=128 >>> 322 2023-10-03 10:18:31.702729 192.168.1.134 → 192.168.1.106 TCP 66 >>> 803 → 2049 [ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=3081688040 >>> TSecr=1993141756 >>> 323 2023-10-03 10:18:31.702737 192.168.1.134 → 192.168.1.106 NFS 238 >>> V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC >>> 324 2023-10-03 10:18:31.702893 192.168.1.106 → 192.168.1.134 TCP 66 >>> 2049 → 803 [ACK] Seq=1 Ack=173 Win=31872 Len=0 TSval=1993141756 >>> TSecr=3081688040 >>> 749 2023-10-03 10:19:01.880214 192.168.1.106 → 192.168.1.134 NFS 206 >>> V3 WRITE Reply (Call In 323) Len: 4 FILE_SYNC >>> 750 2023-10-03 10:19:01.880616 192.168.1.134 → 192.168.1.106 TCP 66 >>> 803 → 2049 [ACK] Seq=173 Ack=141 Win=32000 Len=0 TSval=3081718241 >>> TSecr=1993171927 >>> >>> >>>> SYNs. If the server is not replying to our SYN packets, then the TCP >>>> layer will back off and retransmit. So there is already a backoff-retry >>>> happening at that level. >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Signed-off-by: Trond Myklebust >>>>>>>> <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>>> --- >>>>>>>> net/sunrpc/xprtsock.c | 10 +++++++++- >>>>>>>> 1 file changed, 9 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c >>>>>>>> index 71848ab90d13..1a96777f0ed5 100644 >>>>>>>> --- a/net/sunrpc/xprtsock.c >>>>>>>> +++ b/net/sunrpc/xprtsock.c >>>>>>>> @@ -62,6 +62,7 @@ >>>>>>>> #include "sunrpc.h" >>>>>>>> >>>>>>>> static void xs_close(struct rpc_xprt *xprt); >>>>>>>> +static void xs_reset_srcport(struct sock_xprt *transport); >>>>>>>> static void xs_set_srcport(struct sock_xprt *transport, struct >>>>>>>> socket *sock); >>>>>>>> static void xs_tcp_set_socket_timeouts(struct rpc_xprt *xprt, >>>>>>>> struct socket *sock); >>>>>>>> @@ -1565,8 +1566,10 @@ static void xs_tcp_state_change(struct >>>>>>>> sock >>>>>>>> *sk) >>>>>>>> break; >>>>>>>> case TCP_CLOSE: >>>>>>>> if (test_and_clear_bit(XPRT_SOCK_CONNECTING, >>>>>>>> - &transport- >>>>>>>>> sock_state)) >>>>>>>> + &transport->sock_state)) >>>>>>>> { >>>>>>>> + xs_reset_srcport(transport); >>>>>>>> xprt_clear_connecting(xprt); >>>>>>>> + } >>>>>>>> clear_bit(XPRT_CLOSING, &xprt->state); >>>>>>>> /* Trigger the socket release */ >>>>>>>> xs_run_error_worker(transport, >>>>>>>> XPRT_SOCK_WAKE_DISCONNECT); >>>>>>>> @@ -1722,6 +1725,11 @@ static void xs_set_port(struct rpc_xprt >>>>>>>> *xprt, unsigned short port) >>>>>>>> xs_update_peer_port(xprt); >>>>>>>> } >>>>>>>> >>>>>>>> +static void xs_reset_srcport(struct sock_xprt *transport) >>>>>>>> +{ >>>>>>>> + transport->srcport = 0; >>>>>>>> +} >>>>>>>> + >>>>>>>> static void xs_set_srcport(struct sock_xprt *transport, struct >>>>>>>> socket *sock) >>>>>>>> { >>>>>>>> if (transport->srcport == 0 && transport- >>>>>>>>> xprt.reuseport) >>>>>>>> -- >>>>>>>> 2.41.0 >>>>>>>> >>>>>> >>>>>> -- >>>>>> Trond Myklebust Linux NFS client maintainer, Hammerspace >>>>>> trond.myklebust@xxxxxxxxxxxxxxx >>>> >>>> -- >>>> Trond Myklebust >>>> Linux NFS client maintainer, Hammerspace >>>> trond.myklebust@xxxxxxxxxxxxxxx >>>> >>>> >> >> -- >> Jeff Layton <jlayton@xxxxxxxxxx> -- Chuck Lever