Introduce a new helper that allows assigning a previously-found socket to the skb as the packet is received towards the stack, to cause the stack to guide the packet towards that socket subject to local routing configuration. This series is a spiritual successor to previous discussions on-list[0] and in-person at LPC2019[1] to support TProxy use cases more directly from eBPF programs attached at TC ingress, to simplify and streamline Linux stack configuration in scale environments with Cilium. Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any existing socket reference associated with the skb. Existing tproxy implementations in netfilter get around this restriction by running the tproxy logic after ip_rcv_core() in the PREROUTING table. However, this is not an option for TC-based logic (including eBPF programs attached at TC ingress). This series proposes to introduce a new metadata destination, dst_sk_prefetch, which communicates from earlier paths in the stack that the socket has been prefetched and ip{,6}_rcv_core() should respect this socket selection and retain the reference on the skb. My initial implementation of the dst_sk_prefetch held no metadata and was simply a unique pointer that could be used to make this determination in the ip receive core. However, throughout the testing phase it became apparent that this minimal implementation was not enough to allow socket redirection for traffic between two local processes. Specifically, if the destination was retained as the dst_sk_prefetch metadata destination, or if the destination was dropped from the skb, then during ip{,6}_rcv_finish_core() the destination would be considered invalid and subject the packet to routing. In this case, loopback traffic from 127.0.0.1 to 127.0.0.1 would be considered martian (both martian source and martian destination) because that layer assumes that any such loopback traffic would already have the valid loopback destination configured on the skb so the routing check would be skipped. To resolve this issue, I extended dst_sk_prefetch to act as a wrapper for any existing destination (such as the loopback destination) by stashing the existing destination into a per-cpu variable for the duration of processing between TC ingress hook and the ip receive core. Since the existing destination may be reference-counted, close attention must be paid to any paths that may cause the packet to be queued for processing on another CPU, to ensure that the reference is not lost. To this end, the TC logic checks if the eBPF program return code may indicate intention to pass the packet anywhere other than up the stack; the error paths in skb cleanup handle the dst_sk_prefetch; and finally after the skb_orphan check in ip receive core the original destination reference (if available) is restored to the skb. The eBPF API extension itself, bpf_sk_assign() is pretty straightforward in that it takes an skb and socket, and associates them together. The helper takes its own reference to the socket to ensure it remains accessible beyond the release of rcu_read_lock, and the socket is associated to the skb; the subsequent release of that reference is handled by existing skb cleanup functions. Additionally, the helper associates the new dst_sk_prefetch destination with the skb to communicate the socket prefetch intention with the ingress path. Finally, tests (courtesy Lorenz Bauer) are added to validate the functionality. In addition to testing with the selftests in the tree, I have validated the runtime behaviour of the new helper by extending Cilium to make use of the functionality in lieu of existing tproxy logic. This series is laid out as follows: * Patches 1-2 prepare the dst_sk_prefetch for use by sk_assign(). * Patch 3 extends the eBPF API for sk_assign and uses dst_sk_prefetch to store the socket reference and retain it through ip receive. * Patch 4 is a minor optimization to prefetch the socket destination for established sockets. * Patches 5-7 add and extend the selftests with examples of the new functionality and validation of correct behaviour. [0] https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg303645.html [1] https://linuxplumbersconf.org/event/4/contributions/464/ Joe Stringer (6): dst: Move skb_dst_drop to skbuff.c dst: Add socket prefetch metadata destinations bpf: Add socket assign support dst: Prefetch established socket destinations selftests: bpf: Extend sk_assign for address proxy selftests: bpf: Improve debuggability of sk_assign Lorenz Bauer (1): selftests: bpf: add test for sk_assign include/linux/skbuff.h | 1 + include/net/dst.h | 14 -- include/net/dst_metadata.h | 31 +++ include/uapi/linux/bpf.h | 23 +- net/core/dst.c | 44 ++++ net/core/filter.c | 28 +++ net/core/skbuff.c | 18 ++ net/ipv4/ip_input.c | 5 +- net/ipv6/ip6_input.c | 5 +- net/sched/act_bpf.c | 3 + tools/include/uapi/linux/bpf.h | 18 +- tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/progs/test_sk_assign.c | 127 ++++++++++ tools/testing/selftests/bpf/test_sk_assign.c | 231 ++++++++++++++++++ tools/testing/selftests/bpf/test_sk_assign.sh | 22 ++ 16 files changed, 555 insertions(+), 19 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/test_sk_assign.c create mode 100644 tools/testing/selftests/bpf/test_sk_assign.c create mode 100755 tools/testing/selftests/bpf/test_sk_assign.sh -- 2.20.1