On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxxxx> wrote: > The SO_REUSEPORT option allows sockets to listen on the same port and to > accept connections evenly. However, there is a defect in the current > implementation [1]. When a SYN packet is received, the connection is tied > to a listening socket. Accordingly, when the listener is closed, in-flight > requests during the three-way handshake and child sockets in the accept > queue are dropped even if other listeners on the same port could accept > such connections. > > This situation can happen when various server management tools restart > server (such as nginx) processes. For instance, when we change nginx > configurations and restart it, it spins up new workers that respect the new > configuration and closes all listeners on the old workers, resulting in the > in-flight ACK of 3WHS is responded by RST. This is IMHO a userspace bug. You should never be closing or creating new SO_REUSEPORT sockets on a running server (listening port). There's at least 3 ways to accomplish this. One involves a shim parent process that takes care of creating the sockets (without close-on-exec), then fork-exec's the actual server process[es] (which will use the already opened listening fds), and can thus re-fork-exec a new child while using the same set of sockets. Here the old server can terminate before the new one starts. (one could even envision systemd being modified to support this...) The second involves the old running server fork-execing the new server and handing off the non-CLOEXEC sockets that way. The third approach involves unix fd passing of sockets to hand off the listening sockets from the old process/thread(s) to the new process/thread(s). Once handed off the old server can stop accept'ing on the listening sockets and close them (the real copies are in the child), finish processing any still active connections (or time them out) and terminate. Either way you're never creating new SO_REUSEPORT sockets (dup doesn't count), nor closing the final copy of a given socket. This is basically the same thing that was needed not to lose incoming connections in a pre-SO_REUSEPORT world. (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from triggering a RST during the server restart, it just makes the window when RSTs happen shorter) This was from day one (I reported to Tom and worked with him on the very initial distribution function) envisioned to work like this, and we (Google) have always used it with unix fd handoff to support transparent restart.