On 4/28/21 3:27 AM, Martin KaFai Lau wrote: > On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote: >> >> >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote: >>> The SO_REUSEPORT option allows sockets to listen on the same port and to >>> accept connections evenly. However, there is a defect in the current >>> implementation [1]. When a SYN packet is received, the connection is tied >>> to a listening socket. Accordingly, when the listener is closed, in-flight >>> requests during the three-way handshake and child sockets in the accept >>> queue are dropped even if other listeners on the same port could accept >>> such connections. >>> >>> This situation can happen when various server management tools restart >>> server (such as nginx) processes. For instance, when we change nginx >>> configurations and restart it, it spins up new workers that respect the new >>> configuration and closes all listeners on the old workers, resulting in the >>> in-flight ACK of 3WHS is responded by RST. >> >> Hi Kuniyuki, >> >> I had implemented a different approach to this that I wanted to get your >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver >> that you want to replace with a 'new' webserver, you would need a separate >> process to receive the listen fd and then have that process send the fd to >> the new webserver, if they are not running con-currently. So instead what >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do: >> >> 1) bind unix socket with path '/sockets' >> 2) sendmsg() the listen fd via the unix socket >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so) >> 3) exit/close the old webserver and the listen socket >> 4) start the new webserver >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions) >> 6) recvmsg() the listen fd >> >> So the idea is that we set a timeout on the unix socket. If the new process >> does not start and bind to the unix socket, it simply closes, thus releasing >> the listen socket. However, if it does bind it can now call recvmsg() and >> use the listen fd as normal. It can then simply continue to use the old listen >> fds and/or create new ones and drain the old ones. >> >> Thus, the old and new webservers do not have to run concurrently. This doesn't >> involve any changes to the tcp layer and can be used to pass any type of fd. >> not sure if it's actually useful for anything else though. > We also used to do tcp-listen(/udp) fd transfer because the new process can not > bind to the same IP:PORT in the old kernel without SO_REUSEPORT. Some of the > services listen to many different IP:PORT(s). Transferring all of them > was ok-ish but the old and new process do not necessary listen to the same set > of IP:PORT(s) (e.g. the config may have changed during restart) and it further > complicates the fd transfer logic in the userspace. > > It was then moved to SO_REUSEPORT. The new process can create its listen fds > without depending on the old process. It pretty much starts as if there is > no old process. There is no need to transfer the fds, simplified the userspace > logic. The old and new process can work independently. The old and new process > still run concurrently for a brief time period to avoid service disruption. > Note that another technique is to force syncookies during the switch of old/new servers. echo 2 >/proc/sys/net/ipv4/tcp_syncookies If there is interest, we could add a socket option to override the sysctl on a per-socket basis.