On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote: > Add support for TPROXY via a new bpf helper, bpf_sk_assign(). > > This helper requires the BPF program to discover the socket via a call > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The > helper takes its own reference to the socket in addition to any existing > reference that may or may not currently be obtained for the duration of > BPF processing. For the destination socket to receive the traffic, the > traffic must be routed towards that socket via local route, the socket I also missed where is the local route check in the patch. Is it implied by a sk can be found in bpf_sk*_lookup_*()? > must have the transparent option enabled out-of-band, and the socket > must not be closing. If all of these conditions hold, the socket will be > assigned to the skb to allow delivery to the socket. > > The recently introduced dst_sk_prefetch is used to communicate from the > TC layer to the IP receive layer that the socket should be retained > across the receive. The dst_sk_prefetch destination wraps any existing > destination (if available) and stores it temporarily in a per-cpu var. > > To ensure that no dst references held by the skb prior to sk_assign() > are lost, they are stored in the per-cpu variable associated with > dst_sk_prefetch. When the BPF program invocation from the TC action > completes, we check the return code against TC_ACT_OK and if any other > return code is used, we restore the dst to avoid unintentionally leaking > the reference held in the per-CPU variable. If the packet is cloned or > dropped before reaching ip{,6}_rcv_core(), the original dst will also be > restored from the per-cpu variable to avoid the leak; if the packet makes > its way to the receive function for the protocol, then the destination > (if any) will be restored to the packet at that point. > [ ... ] > diff --git a/net/core/filter.c b/net/core/filter.c > index cd0a532db4e7..bae0874289d8 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = { > .arg5_type = ARG_CONST_SIZE, > }; > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags) > +{ > + if (flags != 0) > + return -EINVAL; > + if (!skb_at_tc_ingress(skb)) > + return -EOPNOTSUPP; > + if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) > + return -ENOENT; > + > + skb_orphan(skb); > + skb->sk = sk; sk is from the bpf_sk*_lookup_*() which does not consider the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF. However, the use-case is currently limited to sk inspection. It now supports selecting a particular sk to receive traffic. Any plan in supporting that? > + skb->destructor = sock_edemux; > + dst_sk_prefetch_store(skb); > + > + return 0; > +} > + [ ... ] > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > index aa438c6758a7..9bd4858d20fc 100644 > --- a/net/ipv4/ip_input.c > +++ b/net/ipv4/ip_input.c > @@ -509,7 +509,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net) > IPCB(skb)->iif = skb->skb_iif; > > /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > + if (skb_dst_is_sk_prefetch(skb)) > + dst_sk_prefetch_fetch(skb); > + else > + skb_orphan(skb); > > return skb; > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c > index 7b089d0ac8cd..f7b42adca9d0 100644 > --- a/net/ipv6/ip6_input.c > +++ b/net/ipv6/ip6_input.c > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev, > rcu_read_unlock(); > > /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > + if (skb_dst_is_sk_prefetch(skb)) > + dst_sk_prefetch_fetch(skb); > + else > + skb_orphan(skb); If I understand it correctly, this new test is to skip the skb_orphan() call for locally routed skb. Others cases (forward?) still depend on skb_orphan() to be called here? > > return skb; > err: > diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c > index 46f47e58b3be..b4c557e6158d 100644 > --- a/net/sched/act_bpf.c > +++ b/net/sched/act_bpf.c > @@ -11,6 +11,7 @@ > #include <linux/filter.h> > #include <linux/bpf.h> > > +#include <net/dst_metadata.h> > #include <net/netlink.h> > #include <net/pkt_sched.h> > #include <net/pkt_cls.h> > @@ -53,6 +54,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, > bpf_compute_data_pointers(skb); > filter_res = BPF_PROG_RUN(filter, skb); > } > + if (filter_res != TC_ACT_OK) > + dst_sk_prefetch_reset(skb); > rcu_read_unlock(); > > /* A BPF program may overwrite the default action opcode. > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 40b2d9476268..546e9e1368ff 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -2914,6 +2914,21 @@ union bpf_attr { > * of sizeof(struct perf_branch_entry). > * > * **-ENOENT** if architecture does not support branch records. > + * > + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags) > + * Description > + * Assign the *sk* to the *skb*. > + * > + * This operation is only valid from TC ingress path. > + * > + * The *flags* argument must be zero. > + * Return > + * 0 on success, or a negative errno in case of failure. > + * > + * * **-EINVAL** Unsupported flags specified. > + * * **-EOPNOTSUPP**: Unsupported operation, for example a > + * call from outside of TC ingress. > + * * **-ENOENT** The socket cannot be assigned. > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -3035,7 +3050,8 @@ union bpf_attr { > FN(tcp_send_ack), \ > FN(send_signal_thread), \ > FN(jiffies64), \ > - FN(read_branch_records), > + FN(read_branch_records), \ > + FN(sk_assign), > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call > -- > 2.20.1 >