On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski <zenczykowski@xxxxxxxxx> wrote: > On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxxxx> wrote: > > The SO_REUSEPORT option allows sockets to listen on the same port and to > > accept connections evenly. However, there is a defect in the current > > implementation [1]. When a SYN packet is received, the connection is tied > > to a listening socket. Accordingly, when the listener is closed, in-flight > > requests during the three-way handshake and child sockets in the accept > > queue are dropped even if other listeners on the same port could accept > > such connections. > > > > This situation can happen when various server management tools restart > > server (such as nginx) processes. For instance, when we change nginx > > configurations and restart it, it spins up new workers that respect the new > > configuration and closes all listeners on the old workers, resulting in the > > in-flight ACK of 3WHS is responded by RST. > > This is IMHO a userspace bug. > > You should never be closing or creating new SO_REUSEPORT sockets on a > running server (listening port). > > There's at least 3 ways to accomplish this. > > One involves a shim parent process that takes care of creating the > sockets (without close-on-exec), > then fork-exec's the actual server process[es] (which will use the > already opened listening fds), > and can thus re-fork-exec a new child while using the same set of sockets. > Here the old server can terminate before the new one starts. > > (one could even envision systemd being modified to support this...) > > The second involves the old running server fork-execing the new server > and handing off the non-CLOEXEC sockets that way. (this doesn't even need to be fork-exec -- can just be exec -- and is potentially easier) > The third approach involves unix fd passing of sockets to hand off the > listening sockets from the old process/thread(s) to the new > process/thread(s). Once handed off the old server can stop accept'ing > on the listening sockets and close them (the real copies are in the > child), finish processing any still active connections (or time them (this doesn't actually need to be a child, in can be an entirely new parallel instance of the server, potentially running in an entirely new container/cgroup setup, though in the same network namespace) > out) and terminate. > > Either way you're never creating new SO_REUSEPORT sockets (dup doesn't > count), nor closing the final copy of a given socket. > > This is basically the same thing that was needed not to lose incoming > connections in a pre-SO_REUSEPORT world. > (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from > triggering a RST during the server restart, it just makes the window > when RSTs happen shorter) > > This was from day one (I reported to Tom and worked with him on the > very initial distribution function) envisioned to work like this, > and we (Google) have always used it with unix fd handoff to support > transparent restart.