Jakub Sitnicki wrote: > With the realization that properly cloning listening sockets that have > psock state/callbacks is tricky, comes the second version of patches. > > The spirit of the patch set stays the same - make SOCKMAP a generic > collection for listening and established sockets. This would let us use the > SOCKMAP with reuseport today, and in the future hopefully with BPF programs > that run at socket lookup time [0]. For a bit more context, please see v1 > cover letter [1]. > > The biggest change that happened since v1 is how we deal with clearing > psock state in a copy of parent socket when cloning it (patches 3 & 4). > > As much as I did not want to touch icsk/tcp clone path, it seems > unavoidable. The changes were kept down to a minimum, with attention to not > break existing users. That said, a review from the TCP maintainer would be > invaluable (patches 3 & 4). > > Patches 1 & 2 will conflict with recently posted "Fixes for sockmap/tls > from more complex BPF progs" series [0]. I'll adapt or split them out this > series once sockmap/tls fixes from John land in bpf-next branch. Thanks I just posted a v2 of that series so once that lands we will need to respin this series. > > Some food for thought - is mixing listening and established sockets in the > same BPF map a good idea? I don't know but I couldn't find a good reason to > restrict the user. +1 in general I've been trying to avoid adding arbitrary restriction. In this case I agree I can't think of a good reason to do it for my use cases but lets not stop someone from doing it if their use case wants to for some reason. > > Considering how much the code evolved, I didn't carry over Acks from v1. Sounds good thanks for keeping this series going. > > Thanks, > jkbs > > [0] https://lore.kernel.org/bpf/157851776348.1732.12600714815781177085.stgit@ubuntu3-kvm2/T/#t > [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@xxxxxxxxxxxxxx/ > > v1 -> v2: > > - af_ops->syn_recv_sock callback is no longer overridden and burdened with > restoring sk_prot and clearing sk_user_data in the child socket. As child > socket is already hashed when syn_recv_sock returns, it is too late to > put it in the right state. Instead patches 3 & 4 restore sk_prot and > clear sk_user_data before we hash the child socket. (Pointed out by > Martin Lau) > > - Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as > we write to it from sk_msg while socket might be getting cloned on > another CPU. (Suggested by John Fastabend) > > - Convert tests for SOCKMAP holding listening sockets to return-on-error > style, and hook them up to test_progs. Also use BPF skeleton for setup. > Add new tests to cover the race scenario discovered during v1 review. > > RFC -> v1: > > - Switch from overriding proto->accept to af_ops->syn_recv_sock, which > happens earlier. Clearing the psock state after accept() does not work > for child sockets that become orphaned (never got accepted). v4-mapped > sockets need special care. > > - Return the socket cookie on SOCKMAP lookup from syscall to be on par with > REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from > syscall. > > - Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress) > SOCKMAP helpers fail when target socket is a listening one. > > - Make bpf_sk_select_reuseport helper fail when target is a TCP established > socket. > > - Teach libbpf to recognize SK_REUSEPORT program type from section name. > > - Add a dedicated set of tests for SOCKMAP holding listening sockets, > covering map operations, overridden socket callbacks, and BPF helpers. > > > Jakub Sitnicki (11): > bpf, sk_msg: Don't reset saved sock proto on restore > net, sk_msg: Annotate lockless access to sk_prot on clone > net, sk_msg: Clear sk_user_data pointer on clone if tagged > tcp_bpf: Don't let child socket inherit parent protocol ops on copy > bpf, sockmap: Allow inserting listening TCP sockets into sockmap > bpf, sockmap: Don't set up sockmap progs for listening sockets > bpf, sockmap: Return socket cookie on lookup from syscall > bpf, sockmap: Let all kernel-land lookup values in SOCKMAP > bpf: Allow selecting reuseport socket from a SOCKMAP > selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP > selftests/bpf: Tests for SOCKMAP holding listening sockets > > include/linux/skmsg.h | 14 +- > include/net/sock.h | 37 +- > include/net/tcp.h | 1 + > kernel/bpf/verifier.c | 6 +- > net/core/filter.c | 15 +- > net/core/skmsg.c | 2 +- > net/core/sock.c | 11 +- > net/core/sock_map.c | 120 +- > net/ipv4/tcp_bpf.c | 19 +- > net/ipv4/tcp_minisocks.c | 2 + > net/ipv4/tcp_ulp.c | 2 +- > net/tls/tls_main.c | 2 +- > .../bpf/prog_tests/select_reuseport.c | 60 +- > .../selftests/bpf/prog_tests/sockmap_listen.c | 1378 +++++++++++++++++ > .../selftests/bpf/progs/test_sockmap_listen.c | 76 + > tools/testing/selftests/bpf/test_maps.c | 6 +- > 16 files changed, 1696 insertions(+), 55 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_listen.c > > -- > 2.24.1 >