On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote: > This patch lets reuseport_detach_sock() return a pointer of struct sock, > which is used only by inet_unhash(). If it is not NULL, > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV > sockets from the closing listener to the selected one. > > Listening sockets hold incoming connections as a linked list of struct > request_sock in the accept queue, and each request has reference to a full > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the > requests from the closing listener's queue and relink them to the head of > the new listener's queue. We do not process each request, so the migration > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV > sockets, we will take special care in the next commit. > > By default, we select the last element of socks[] as the new listener. > This behaviour is based on how the kernel moves sockets in socks[]. > > For example, we call listen() for four sockets (A, B, C, D), and close the > first two by turns. The sockets move in socks[] like below. (See also [1]) > > socks[0] : A <-. socks[0] : D socks[0] : D > socks[1] : B | => socks[1] : B <-. => socks[1] : C > socks[2] : C | socks[2] : C --' > socks[3] : D --' > > Then, if C and D have newer settings than A and B, and each socket has a > request (a, b, c, d) in their accept queue, we can redistribute old > requests evenly to new listeners. I don't think it should emphasize/claim there is a specific way that the kernel-pick here can redistribute the requests evenly. It depends on how the application close/listen. The userspace can not expect the ordering of socks[] will behave in a certain way. The primary redistribution policy has to depend on BPF which is the policy defined by the user based on its application logic (e.g. how its binary restart work). The application (and bpf) knows which one is a dying process and can avoid distributing to it. The kernel-pick could be an optional fallback but not a must. If the bpf prog is attached, I would even go further to call bpf to redistribute regardless of the sysctl, so I think the sysctl is not necessary. > > socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d) > socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c) > socks[2] : C (c) | socks[2] : C (c) --' > socks[3] : D (d) --' >