From: Eric Dumazet <eric.dumazet@xxxxxxxxx> Date: Wed, 28 Apr 2021 16:18:30 +0200 > On 4/28/21 3:27 AM, Martin KaFai Lau wrote: > > On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote: > >> > >> > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote: > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to > >>> accept connections evenly. However, there is a defect in the current > >>> implementation [1]. When a SYN packet is received, the connection is tied > >>> to a listening socket. Accordingly, when the listener is closed, in-flight > >>> requests during the three-way handshake and child sockets in the accept > >>> queue are dropped even if other listeners on the same port could accept > >>> such connections. > >>> > >>> This situation can happen when various server management tools restart > >>> server (such as nginx) processes. For instance, when we change nginx > >>> configurations and restart it, it spins up new workers that respect the new > >>> configuration and closes all listeners on the old workers, resulting in the > >>> in-flight ACK of 3WHS is responded by RST. > >> > >> Hi Kuniyuki, > >> > >> I had implemented a different approach to this that I wanted to get your > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver > >> that you want to replace with a 'new' webserver, you would need a separate > >> process to receive the listen fd and then have that process send the fd to > >> the new webserver, if they are not running con-currently. So instead what > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do: > >> > >> 1) bind unix socket with path '/sockets' > >> 2) sendmsg() the listen fd via the unix socket > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so) > >> 3) exit/close the old webserver and the listen socket > >> 4) start the new webserver > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions) > >> 6) recvmsg() the listen fd > >> > >> So the idea is that we set a timeout on the unix socket. If the new process > >> does not start and bind to the unix socket, it simply closes, thus releasing > >> the listen socket. However, if it does bind it can now call recvmsg() and > >> use the listen fd as normal. It can then simply continue to use the old listen > >> fds and/or create new ones and drain the old ones. > >> > >> Thus, the old and new webservers do not have to run concurrently. This doesn't > >> involve any changes to the tcp layer and can be used to pass any type of fd. > >> not sure if it's actually useful for anything else though. > > We also used to do tcp-listen(/udp) fd transfer because the new process can not > > bind to the same IP:PORT in the old kernel without SO_REUSEPORT. Some of the > > services listen to many different IP:PORT(s). Transferring all of them > > was ok-ish but the old and new process do not necessary listen to the same set > > of IP:PORT(s) (e.g. the config may have changed during restart) and it further > > complicates the fd transfer logic in the userspace. > > > > It was then moved to SO_REUSEPORT. The new process can create its listen fds > > without depending on the old process. It pretty much starts as if there is > > no old process. There is no need to transfer the fds, simplified the userspace > > logic. The old and new process can work independently. The old and new process > > still run concurrently for a brief time period to avoid service disruption. > > > > > Note that another technique is to force syncookies during the switch of old/new servers. > > echo 2 >/proc/sys/net/ipv4/tcp_syncookies > > If there is interest, we could add a socket option to override the sysctl on a per-socket basis. It can be a work-around but syncookies has its own downside. Forcing it may lose some valuable TCP options. If there is an approach without syncookies, it is better.