From: Maciej Żenczykowski <zenczykowski@xxxxxxxxx> Date: Tue, 27 Apr 2021 15:00:12 -0700 > On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski > <zenczykowski@xxxxxxxxx> wrote: > > On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxxxx> wrote: > > > The SO_REUSEPORT option allows sockets to listen on the same port and to > > > accept connections evenly. However, there is a defect in the current > > > implementation [1]. When a SYN packet is received, the connection is tied > > > to a listening socket. Accordingly, when the listener is closed, in-flight > > > requests during the three-way handshake and child sockets in the accept > > > queue are dropped even if other listeners on the same port could accept > > > such connections. > > > > > > This situation can happen when various server management tools restart > > > server (such as nginx) processes. For instance, when we change nginx > > > configurations and restart it, it spins up new workers that respect the new > > > configuration and closes all listeners on the old workers, resulting in the > > > in-flight ACK of 3WHS is responded by RST. > > > > This is IMHO a userspace bug. I do not think so. If the kernel selected another listener for incoming connections, they could be accept()ed. There is no room for usersapce to change the behaviour without an in-kernel tool, eBPF. A feature that can cause failure stochastically due to kernel behaviour cannot be a userspace bug. > > > > You should never be closing or creating new SO_REUSEPORT sockets on a > > running server (listening port). > > > > There's at least 3 ways to accomplish this. > > > > One involves a shim parent process that takes care of creating the > > sockets (without close-on-exec), > > then fork-exec's the actual server process[es] (which will use the > > already opened listening fds), > > and can thus re-fork-exec a new child while using the same set of sockets. > > Here the old server can terminate before the new one starts. > > > > (one could even envision systemd being modified to support this...) > > > > The second involves the old running server fork-execing the new server > > and handing off the non-CLOEXEC sockets that way. > > (this doesn't even need to be fork-exec -- can just be exec -- and is > potentially easier) > > > The third approach involves unix fd passing of sockets to hand off the > > listening sockets from the old process/thread(s) to the new > > process/thread(s). Once handed off the old server can stop accept'ing > > on the listening sockets and close them (the real copies are in the > > child), finish processing any still active connections (or time them > > (this doesn't actually need to be a child, in can be an entirely new > parallel instance of the server, > potentially running in an entirely new container/cgroup setup, though > in the same network namespace) > > > out) and terminate. > > > > Either way you're never creating new SO_REUSEPORT sockets (dup doesn't > > count), nor closing the final copy of a given socket. Indeed each approach can be an option, but it makes application more complicated. Also what if the process holding the listener fd died, there could be down time. I do not think every approach works well in everywhere for everyone. > > > > This is basically the same thing that was needed not to lose incoming > > connections in a pre-SO_REUSEPORT world. > > (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from > > triggering a RST during the server restart, it just makes the window > > when RSTs happen shorter) SO_REUSEPORT makes each process/listener independent, and we need not pass fds. So, it makes application much simpler. Even with SO_REUSEPORT, one listener might crash, but it is more tolerant than losing all connections at once. To enjoy such merits, isn't it natural to improve the existing feature in this post-SO_REUSEPORT world? > > > > This was from day one (I reported to Tom and worked with him on the > > very initial distribution function) envisioned to work like this, > > and we (Google) have always used it with unix fd handoff to support > > transparent restart.