On Thu, Mar 26, 2020 at 09:25:51PM -0700, Joe Stringer wrote: > Introduce a new helper that allows assigning a previously-found socket > to the skb as the packet is received towards the stack, to cause the > stack to guide the packet towards that socket subject to local routing > configuration. The intention is to support TProxy use cases more > directly from eBPF programs attached at TC ingress, to simplify and > streamline Linux stack configuration in scale environments with Cilium. > > Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any > existing socket reference associated with the skb. Existing tproxy > implementations in netfilter get around this restriction by running the > tproxy logic after ip_rcv_core() in the PREROUTING table. However, this > is not an option for TC-based logic (including eBPF programs attached at > TC ingress). > > This series introduces the BPF helper bpf_sk_assign() to associate the > socket with the skb on the ingress path as the packet is passed up the > stack. The initial patch in the series simply takes a reference on the > socket to ensure safety, but later patches relax this for listen > sockets. > > To ensure delivery to the relevant socket, we still consult the routing > table, for full examples of how to configure see the tests in patch #5; > the simplest form of the route would look like this: > > $ ip route add local default dev lo > > This series is laid out as follows: > * Patch 1 extends the eBPF API to add sk_assign() and defines a new > socket free function to allow the later paths to understand when the > socket associated with the skb should be kept through receive. > * Patches 2-3 optimize the receive path to avoid taking a reference on > listener sockets during receive. > * Patches 4-5 extends the selftests with examples of the new > functionality and validation of correct behaviour. > > Changes since v2: > * Add selftests for UDP socket redirection > * Drop the early demux optimization patch (defer for more testing) > * Fix check for orphaning after TC act return > * Tidy up the tests to clean up properly and be less noisy. > > Changes since v1: > * Replace the metadata_dst approach with using the skb->destructor to > determine whether the socket has been prefetched. This is much > simpler. > * Avoid taking a reference on listener sockets during receive > * Restrict assigning sockets across namespaces > * Restrict assigning SO_REUSEPORT sockets > * Fix cookie usage for socket dst check > * Rebase the tests against test_progs infrastructure > * Tidy up commit messages lgtm. Acked-by: Martin KaFai Lau <kafai@xxxxxx>