On Wed, Mar 18, 2020 at 11:24:11PM -0700, Joe Stringer wrote: > On Wed, Mar 18, 2020 at 11:49 AM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > On Tue, Mar 17, 2020 at 05:46:58PM -0700, Joe Stringer wrote: > > > On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > > > > > On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote: > > > > > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > > > > > > > > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote: > > > > > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign(). > > > > > > > > > > > > > > This helper requires the BPF program to discover the socket via a call > > > > > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The > > > > > > > helper takes its own reference to the socket in addition to any existing > > > > > > > reference that may or may not currently be obtained for the duration of > > > > > > > BPF processing. For the destination socket to receive the traffic, the > > > > > > > traffic must be routed towards that socket via local route, the socket > > > > > > I also missed where is the local route check in the patch. > > > > > > Is it implied by a sk can be found in bpf_sk*_lookup_*()? > > > > > > > > > > This is a requirement for traffic redirection, it's not enforced by > > > > > the patch. If the operator does not configure routing for the relevant > > > > > traffic to ensure that the traffic is delivered locally, then after > > > > > the eBPF program terminates, it will pass up through ip_rcv() and > > > > > friends and be subject to the whims of the routing table. (or > > > > > alternatively if the BPF program redirects somewhere else then this > > > > > reference will be dropped). > > > > > > > > > > Maybe there's a path to simplifying this configuration path in future > > > > > to loosen this requirement, but for now I've kept the series as > > > > > minimal as possible on that front. > > > > > > > > > > > [ ... ] > > > > > > > > > > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > > > > > > index cd0a532db4e7..bae0874289d8 100644 > > > > > > > --- a/net/core/filter.c > > > > > > > +++ b/net/core/filter.c > > > > > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = { > > > > > > > .arg5_type = ARG_CONST_SIZE, > > > > > > > }; > > > > > > > > > > > > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags) > > > > > > > +{ > > > > > > > + if (flags != 0) > > > > > > > + return -EINVAL; > > > > > > > + if (!skb_at_tc_ingress(skb)) > > > > > > > + return -EOPNOTSUPP; > > > > > > > + if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) > > > > > > > + return -ENOENT; > > > > > > > + > > > > > > > + skb_orphan(skb); > > > > > > > + skb->sk = sk; > > > > > > sk is from the bpf_sk*_lookup_*() which does not consider > > > > > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF. > > > > > > However, the use-case is currently limited to sk inspection. > > > > > > > > > > > > It now supports selecting a particular sk to receive traffic. > > > > > > Any plan in supporting that? > > > > > > > > > > I think this is a general bpf_sk*_lookup_*() question, previous > > > > > discussion[0] settled on avoiding that complexity before a use case > > > > > arises, for both TC and XDP versions of these helpers; I still don't > > > > > have a specific use case in mind for such functionality. If we were to > > > > > do it, I would presume that the socket lookup caller would need to > > > > > pass a dedicated flag (supported at TC and likely not at XDP) to > > > > > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected > > > > > and used to select the reuseport socket. > > > > It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF > > > > usecase. It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog > > > > will still be run later (e.g. from tcp_v4_rcv) to decide which sk to > > > > recieve the skb. > > > > > > > > If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(), > > > > will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later > > > > to make the final sk decision? > > > > > > I don't believe so, no: > > > > > > ip_local_deliver() > > > -> ... > > > -> ip_protocol_deliver_rcu() > > > -> tcp_v4_rcv() > > > -> __inet_lookup_skb() > > > -> skb_steal_sock(skb) > > > > > > But this will only affect you if you are running both the bpf@tc > > > program with sk_assign() and the reuseport BPF sock programs at the > > > same time. > > I don't think it is the right answer to ask the user to be careful and > > only use either bpf_sk_assign()@tc or bpf_prog@so_reuseport. > > Applying a restriction on reuseport sockets until we sort this out per > my other email should resolve this concern. > > > > This is why I link it back to the bpf_sk*_lookup_*() > > > functions: If the socket lookup in the initial step respects reuseport > > > BPF prog logic and returns the socket using the same logic, then the > > > packet will be directed to the socket you expect. Just like how > > > non-BPF reuseport would work with this series today. > > Changing bpf_sk*_lookup_*() is a way to solve it but I don't know what it > > may run into when recurring bpf_prog, i.e. running bpf@so-reuseport inside > > bpf@tc. That may need a closer look. > > Right, that's my initial concern as well. > > One alternative might be something like: in the helper implementation, > store some bit somewhere to say "we need to resolve the reuseport > later" and then when the TC BPF program returns, check this bit and if > reuseport is necessary, trigger the BPF program for it and fix up the > socket after-the-fact. skb_dst_is_sk_prefetch() could be that bit. One major thing is that bpf@so_reuseport is currently run at the transport layer and expecting skb->data pointing to udp/tcp hdr. The ideal place is to run it there. However, the skb_dst_is_sk_prefetch() bit is currently lost at ip[6]_rcv_core. > A bit uglier though, also not sure how socket > refcounting would work there; maybe we can avoid the refcount in the > socket lookup and then fix it up in the later execution. That should not be an issue if refcnt is not taken for SOCK_RCU_FREE (e.g. TCP_LISTEN) in the first place. > > > [...] > > It is another question that I have. The TCP_LISTEN sk will suffer > > from this extra refcnt, e.g. SYNFLOOD. Can something smarter > > be done in skb->destructor? > > Can you elaborate a bit more on the idea you have here? I am thinking can skb->destructor do something like bpf_sk_release()? This patch reuses tcp sock_edemux which currently only lookups the established sk. > > Looking at the BPF API, it seems like the writer of the program can > use bpf_tcp_gen_syncookie() / bpf_tcp_check_syncookie() to generate > and check syn cookies to mitigate this kind of attack. This at least > provides an option beyond what existing tproxy implementations > provide. When the SYNACK comes back, it will still be served by a TCP_LISTEN sk. I know refcnt sucks on synflood test. I don't know what the effect may be on serving those valid synack since there is no need to measure after SOCK_RCU_FREE is in ;) UDP is also in SOCK_RCU_FREE. I think only early_demux, which seems to be for connected only, takes a refnct. btw, it may be a good idea to add a udp test. I am fine to push them to optimize/support later bucket It is still good to explore a little more such that we don't regret later. > > > In general, it took me a while to wrap my head around thinking > > how a skb->_skb_refdst is related to assigning a sk to skb->sk. > > My understanding is it is a way to tell when not to call > > skb_orphan() here. Have you considered other options (e.g. > > using a bit in skb->sk)? It will be useful to explain > > them in the commit message. > > Good point, I did briefly explore that initially and it looked a lot > more invasive. With that approach, any time we do some kind of socket > handling (assign, release, steal, etc.) we have this extra bit we have to > deal with and decide whether we need to specially handle it. > skb->_skb_refdst already has this ugliness (see skb_dst() and friends) > so on a practical note it seemed less invasive to me to reuse that > infrastructure. > > Conceptually I was looking at this as a metadata destination similar > to the referred patches in one of the earlier commit messages. We > associate this special socket destination initially, to tell ip_rcv() > that we really do need to retain this socket and not just orphan > it/continue with the regular destination selection logic. > > I can roll this explanation into the series header and/or commit > messages as well.