From: Martin KaFai Lau <kafai@xxxxxx> Date: Wed, 18 Nov 2020 15:50:17 -0800 > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote: > > This patch lets reuseport_detach_sock() return a pointer of struct sock, > > which is used only by inet_unhash(). If it is not NULL, > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV > > sockets from the closing listener to the selected one. > > > > Listening sockets hold incoming connections as a linked list of struct > > request_sock in the accept queue, and each request has reference to a full > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the > > requests from the closing listener's queue and relink them to the head of > > the new listener's queue. We do not process each request, so the migration > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV > > sockets, we will take special care in the next commit. > > > > By default, we select the last element of socks[] as the new listener. > > This behaviour is based on how the kernel moves sockets in socks[]. > > > > For example, we call listen() for four sockets (A, B, C, D), and close the > > first two by turns. The sockets move in socks[] like below. (See also [1]) > > > > socks[0] : A <-. socks[0] : D socks[0] : D > > socks[1] : B | => socks[1] : B <-. => socks[1] : C > > socks[2] : C | socks[2] : C --' > > socks[3] : D --' > > > > Then, if C and D have newer settings than A and B, and each socket has a > > request (a, b, c, d) in their accept queue, we can redistribute old > > requests evenly to new listeners. > I don't think it should emphasize/claim there is a specific way that > the kernel-pick here can redistribute the requests evenly. It depends on > how the application close/listen. The userspace can not expect the > ordering of socks[] will behave in a certain way. I've expected replacing listeners by generations as a general use case. But exactly. Users should not expect the undocumented kernel internal. > The primary redistribution policy has to depend on BPF which is the > policy defined by the user based on its application logic (e.g. how > its binary restart work). The application (and bpf) knows which one > is a dying process and can avoid distributing to it. > > The kernel-pick could be an optional fallback but not a must. If the bpf > prog is attached, I would even go further to call bpf to redistribute > regardless of the sysctl, so I think the sysctl is not necessary. I also think it is just an optional fallback, but to pick out a different listener everytime, choosing the moved socket was reasonable. So the even redistribution for a specific use case is a side effect of such socket selection. But, users should decide to use either way: (1) let the kernel select a new listener randomly (2) select a particular listener by eBPF I will update the commit message like: The kernel selects a new listener randomly, but as the side effect, it can redistribute packets evenly for a specific case where an application replaces listeners by generations.