Re: [PATCH] SUNRPC: Don't retry using the same source port if connection failed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Sep 30, 2023 at 7:06 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Sat, 2023-09-30 at 18:36 -0400, Olga Kornievskaia wrote:
> > On Fri, Sep 29, 2023 at 10:57 PM Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, 2023-09-28 at 10:58 -0400, Olga Kornievskaia wrote:
> > > > On Wed, Sep 27, 2023 at 3:35 PM <trondmy@xxxxxxxxxx> wrote:
> > > > >
> > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > >
> > > > > If the TCP connection attempt fails without ever establishing a
> > > > > connection, then assume the problem may be the server is
> > > > > rejecting
> > > > > us
> > > > > due to port reuse.
> > > >
> > > > Doesn't this break 4.0 replay cache? Seems too general to assume
> > > > that
> > > > any unsuccessful SYN was due to a server reboot and it's ok for
> > > > the
> > > > client to change the port.
> > >
> > > This is where things get interesting. Yes, if we change the port
> > > number, then it will almost certainly break NFSv3 and NFSv4.0
> > > replay
> > > caching on the server.
> > >
> > > However the problem is that once we get stuck in the situation
> > > where we
> > > cannot connect, then each new connection attempt is just causing
> > > the
> > > server's TCP layer to push back and recall that the connection from
> > > this port was closed.
> > > IOW: the problem is that once we're in this situation, we cannot
> > > easily
> > > exit without doing one of the following. Either we have to
> > >
> > >    1. Change the port number, so that the TCP layer allows us to
> > >       connect.
> > >    2. Or.. Wait for long enough that the TCP layer has forgotten
> > >       altogether about the previous connection.
> > >
> > > The problem is that option (2) is subject to livelock, and so has a
> > > potential infinite time out. I've seen this livelock in action, and
> > > I'm
> > > not seeing a solution that has predictable results.
> > >
> > > So unless there is a solution for the problems in (2), I don't see
> > > how
> > > we can avoid defaulting to option (1) at some point, in which case
> > > the
> > > only question is "when do we switch ports?".
> >
> > I'm not sure how one can justify that regression that will come out
> > of
> > #1 will be less of a problem then the problem in #2.
> >
> > I think I'm still not grasping why the NFS server would
> > (legitimately)
> > be closing a connection that is re-using the port. Can you present a
> > sequence of events that would lead to this?
> >
>
> Yes. It is essentially the problem described in this blog:
> https://blog.davidvassallo.me/2010/07/13/time_wait-and-port-reuse/
>
> ...and as you can see, it is nothing to do with NFS. This is the TCP
> protocol working as expected.

What I'm seeing are statements that RFC allows for/provides guidance
on how to transition out of TIME_WAIT state. I'm also hearing that the
reasons that the server can't allow for port reuse is due to broken
client implementation or use of (broken?) NAT implementation.

I don't see how any of this justifies allowing a regression in the
linux client code. I'm clearly missing something. How are you possibly
OK with breaking the reply cache?

> > But can't we at least arm ourselves in not unnecessarily breaking the
> > reply cache by at least imposing some timeout/number of retries
> > before
> > resetting? If the client was retrying to unsuccessfully re-establish
> > connection for a (fixed) while, then 4.0 client's lease would expire
> > and switching the port after the lease expires makes no difference.
> > There isn't a solution in v3 unfortunately. But a time-based approach
> > would at least separate these 'peculiar' servers vs normal servers.
> > And if this is a 4.1 client, we can reset the port without a timeout.
> >
>
> This is not a 'peculiar server' vs 'normal server' problem. The reuse
> of ports in this way violates the TCP protocol, and has been a problem

I disagree here. Even the RFC quoted by the blogger says that reuse of
port is allowed.

> for NFS/TCP since the beginning. However, it was never a problem for
> the older connectionless UDP protocol, which is where the practice of
> tying the replay cache to the source port began in the first place.
>
> NFSv4.1 does not have this problem because it deliberately does not
> reuse TCP ports, and the reason is precisely to avoid the TIME_WAIT
> state problems.
>
> NFSv3 tries to avoid it by doing an incremental back off, but we
> recently saw that does not suffice to avoid live lock, after a system
> got stuck for several hours in this state.
>
> > Am I correct that every unsuccessful SYN causes a new source point to
> > be taken? If so, then a server reboot where multiple SYNs are sent
> > prior to connection re-establishment (times number of mounts) might
> > cause source port exhaustion?
> >
>
> No. Not every unsuccessful SYN: It is every unsuccessful sequence of

I disagree. Here's a snippet of the network trace with the proposed
patch. The port is changed on EVERY unsuccessful SYN.

   76 2023-10-03 10:17:04.285731 192.168.1.134 → 192.168.1.106 NFS 238
V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC
   77 2023-10-03 10:17:04.328371 192.168.1.106 → 192.168.1.134 TCP 66
2049 → 909 [ACK] Seq=1113 Ack=1501 Win=31872 Len=0 TSval=3542359002
TSecr=3081600630
  256 2023-10-03 10:18:04.341041 192.168.1.134 → 192.168.1.106 TCP 66
[TCP Keep-Alive] 909 → 2049 [ACK] Seq=1500 Ack=1113 Win=32000 Len=0
TSval=3081660681 TSecr=3542359002
  259 2023-10-03 10:18:04.341500 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 909 [RST] Seq=1113 Win=0 Len=0
  260 2023-10-03 10:18:04.341860 192.168.1.134 → 192.168.1.106 TCP 74
[TCP Port numbers reused] 909 → 2049 [SYN] Seq=0 Win=32120 Len=0
MSS=1460 SACK_PERM TSval=3081660681 TSecr=0 WS=128
  261 2023-10-03 10:18:04.342031 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 909 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  266 2023-10-03 10:18:07.380801 192.168.1.134 → 192.168.1.106 TCP 74
954 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081663720 TSecr=0 WS=128
  267 2023-10-03 10:18:07.380971 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 954 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  275 2023-10-03 10:18:10.423352 192.168.1.134 → 192.168.1.106 TCP 74
856 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081666760 TSecr=0 WS=128
  276 2023-10-03 10:18:10.423621 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 856 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  286 2023-10-03 10:18:13.466277 192.168.1.134 → 192.168.1.106 TCP 74
957 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081669801 TSecr=0 WS=128
  287 2023-10-03 10:18:13.466812 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 957 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  289 2023-10-03 10:18:16.509229 192.168.1.134 → 192.168.1.106 TCP 74
695 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081672841 TSecr=0 WS=128
  290 2023-10-03 10:18:16.509845 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 695 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  294 2023-10-03 10:18:19.551062 192.168.1.134 → 192.168.1.106 TCP 74
940 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081675881 TSecr=0 WS=128
  295 2023-10-03 10:18:19.551434 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 940 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  300 2023-10-03 10:18:22.590380 192.168.1.134 → 192.168.1.106 TCP 74
810 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081678921 TSecr=0
WS=128
  301 2023-10-03 10:18:22.590726 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 810 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  308 2023-10-03 10:18:25.628256 192.168.1.134 → 192.168.1.106 TCP 74
877 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081681961 TSecr=0 WS=128
  309 2023-10-03 10:18:25.628724 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 877 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  312 2023-10-03 10:18:28.665682 192.168.1.134 → 192.168.1.106 TCP 74
934 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081685001 TSecr=0 WS=128
  313 2023-10-03 10:18:28.666374 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 934 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  320 2023-10-03 10:18:31.702236 192.168.1.134 → 192.168.1.106 TCP 74
803 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081688040 TSecr=0 WS=128
  321 2023-10-03 10:18:31.702490 192.168.1.106 → 192.168.1.134 TCP 74
2049 → 803 [SYN, ACK] Seq=0 Ack=1 Win=31856 Len=0 MSS=1460 SACK_PERM
TSval=1993141756 TSecr=3081688040 WS=128
  322 2023-10-03 10:18:31.702729 192.168.1.134 → 192.168.1.106 TCP 66
803 → 2049 [ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=3081688040
TSecr=1993141756
  323 2023-10-03 10:18:31.702737 192.168.1.134 → 192.168.1.106 NFS 238
V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC
  324 2023-10-03 10:18:31.702893 192.168.1.106 → 192.168.1.134 TCP 66
2049 → 803 [ACK] Seq=1 Ack=173 Win=31872 Len=0 TSval=1993141756
TSecr=3081688040
  749 2023-10-03 10:19:01.880214 192.168.1.106 → 192.168.1.134 NFS 206
V3 WRITE Reply (Call In 323) Len: 4 FILE_SYNC

This is the same without the patch. Port is successfully reused.
Replay cache OK here not above.

   76 2023-10-03 10:17:04.285731 192.168.1.134 → 192.168.1.106 NFS 238
V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC
   77 2023-10-03 10:17:04.328371 192.168.1.106 → 192.168.1.134 TCP 66
2049 → 909 [ACK] Seq=1113 Ack=1501 Win=31872 Len=0 TSval=3542359002
TSecr=3081600630
  256 2023-10-03 10:18:04.341041 192.168.1.134 → 192.168.1.106 TCP 66
[TCP Keep-Alive] 909 → 2049 [ACK] Seq=1500 Ack=1113 Win=32000 Len=0
TSval=3081660681 TSecr=3542359002
  259 2023-10-03 10:18:04.341500 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 909 [RST] Seq=1113 Win=0 Len=0
  260 2023-10-03 10:18:04.341860 192.168.1.134 → 192.168.1.106 TCP 74
[TCP Port numbers reused] 909 → 2049 [SYN] Seq=0 Win=32120 Len=0
MSS=1460 SACK_PERM TSval=3081660681 TSecr=0 WS=128
  261 2023-10-03 10:18:04.342031 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 909 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  266 2023-10-03 10:18:07.380801 192.168.1.134 → 192.168.1.106 TCP 74
954 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081663720 TSecr=0 WS=128
  267 2023-10-03 10:18:07.380971 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 954 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  275 2023-10-03 10:18:10.423352 192.168.1.134 → 192.168.1.106 TCP 74
856 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081666760 TSecr=0 WS=128
  276 2023-10-03 10:18:10.423621 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 856 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  286 2023-10-03 10:18:13.466277 192.168.1.134 → 192.168.1.106 TCP 74
957 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081669801 TSecr=0 WS=128
  287 2023-10-03 10:18:13.466812 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 957 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  289 2023-10-03 10:18:16.509229 192.168.1.134 → 192.168.1.106 TCP 74
695 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081672841 TSecr=0 WS=128
  290 2023-10-03 10:18:16.509845 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 695 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  294 2023-10-03 10:18:19.551062 192.168.1.134 → 192.168.1.106 TCP 74
940 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081675881 TSecr=0 WS=128
  295 2023-10-03 10:18:19.551434 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 940 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  300 2023-10-03 10:18:22.590380 192.168.1.134 → 192.168.1.106 TCP 74
810 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081678921 TSecr=0 WS=128
  301 2023-10-03 10:18:22.590726 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 810 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  308 2023-10-03 10:18:25.628256 192.168.1.134 → 192.168.1.106 TCP 74
877 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081681961 TSecr=0 WS=128
  309 2023-10-03 10:18:25.628724 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 877 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  312 2023-10-03 10:18:28.665682 192.168.1.134 → 192.168.1.106 TCP 74
934 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081685001 TSecr=0 WS=128
  313 2023-10-03 10:18:28.666374 192.168.1.106 → 192.168.1.134 TCP 54
2049 → 934 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
  320 2023-10-03 10:18:31.702236 192.168.1.134 → 192.168.1.106 TCP 74
803 → 2049 [SYN] Seq=0 Win=32120 Len=0 MSS=1460 SACK_PERM
TSval=3081688040 TSecr=0 WS=128
  321 2023-10-03 10:18:31.702490 192.168.1.106 → 192.168.1.134 TCP 74
2049 → 803 [SYN, ACK] Seq=0 Ack=1 Win=31856 Len=0 MSS=1460 SACK_PERM
TSval=1993141756 TSecr=3081688040 WS=128
  322 2023-10-03 10:18:31.702729 192.168.1.134 → 192.168.1.106 TCP 66
803 → 2049 [ACK] Seq=1 Ack=1 Win=32128 Len=0 TSval=3081688040
TSecr=1993141756
  323 2023-10-03 10:18:31.702737 192.168.1.134 → 192.168.1.106 NFS 238
V3 WRITE Call, FH: 0x10bedd7c Offset: 0 Len: 4 FILE_SYNC
  324 2023-10-03 10:18:31.702893 192.168.1.106 → 192.168.1.134 TCP 66
2049 → 803 [ACK] Seq=1 Ack=173 Win=31872 Len=0 TSval=1993141756
TSecr=3081688040
  749 2023-10-03 10:19:01.880214 192.168.1.106 → 192.168.1.134 NFS 206
V3 WRITE Reply (Call In 323) Len: 4 FILE_SYNC
  750 2023-10-03 10:19:01.880616 192.168.1.134 → 192.168.1.106 TCP 66
803 → 2049 [ACK] Seq=173 Ack=141 Win=32000 Len=0 TSval=3081718241
TSecr=1993171927


> SYNs. If the server is not replying to our SYN packets, then the TCP
> layer will back off and retransmit. So there is already a backoff-retry
> happening at that level.
>
> >
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Trond Myklebust
> > > > > <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > ---
> > > > >  net/sunrpc/xprtsock.c | 10 +++++++++-
> > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > > index 71848ab90d13..1a96777f0ed5 100644
> > > > > --- a/net/sunrpc/xprtsock.c
> > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > @@ -62,6 +62,7 @@
> > > > >  #include "sunrpc.h"
> > > > >
> > > > >  static void xs_close(struct rpc_xprt *xprt);
> > > > > +static void xs_reset_srcport(struct sock_xprt *transport);
> > > > >  static void xs_set_srcport(struct sock_xprt *transport, struct
> > > > > socket *sock);
> > > > >  static void xs_tcp_set_socket_timeouts(struct rpc_xprt *xprt,
> > > > >                 struct socket *sock);
> > > > > @@ -1565,8 +1566,10 @@ static void xs_tcp_state_change(struct
> > > > > sock
> > > > > *sk)
> > > > >                 break;
> > > > >         case TCP_CLOSE:
> > > > >                 if (test_and_clear_bit(XPRT_SOCK_CONNECTING,
> > > > > -                                       &transport-
> > > > > >sock_state))
> > > > > +                                      &transport->sock_state))
> > > > > {
> > > > > +                       xs_reset_srcport(transport);
> > > > >                         xprt_clear_connecting(xprt);
> > > > > +               }
> > > > >                 clear_bit(XPRT_CLOSING, &xprt->state);
> > > > >                 /* Trigger the socket release */
> > > > >                 xs_run_error_worker(transport,
> > > > > XPRT_SOCK_WAKE_DISCONNECT);
> > > > > @@ -1722,6 +1725,11 @@ static void xs_set_port(struct rpc_xprt
> > > > > *xprt, unsigned short port)
> > > > >         xs_update_peer_port(xprt);
> > > > >  }
> > > > >
> > > > > +static void xs_reset_srcport(struct sock_xprt *transport)
> > > > > +{
> > > > > +       transport->srcport = 0;
> > > > > +}
> > > > > +
> > > > >  static void xs_set_srcport(struct sock_xprt *transport, struct
> > > > > socket *sock)
> > > > >  {
> > > > >         if (transport->srcport == 0 && transport-
> > > > > >xprt.reuseport)
> > > > > --
> > > > > 2.41.0
> > > > >
> > >
> > > --
> > > Trond Myklebust Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@xxxxxxxxxxxxxxx
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux