Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Mon, 21 Jun 2021 09:55:31 -0400

On 2021-06-18 6:42 p.m., Daniel Borkmann wrote:
On 6/18/21 1:40 PM, Jamal Hadi Salim wrote:

[..]
 From a user interface PoV it's odd since you need to go and parse that 
anyway, at
least the programs typically start out with a switch/case on either 
reading the
skb->protocol or getting it via eth->h_proto. But then once you extend 
that same
program to also cover IPv6, you don't need to do anything with the 
ETH_P_ALL
from the loader application, but now you'd also need to additionally 
remember to
downgrade ETH_P_IP to ETH_P_ALL and rebuild the loader to get v6 
traffic. But even
if you were to split things in the main/entry program to separate v4/v6 
processing
into two different ones, I expect this to be faster via tail calls 
(given direct
absolute jump) instead of walking a list of tcf_proto objects, comparing 
the
tp->protocol and going into a different cls_bpf instance.

Good point on being more future proof with ETH_P_ALL.
Note: In our case we were only interested in ipv4 and i dont see that
changing for the specific prog we have. From a compute perspective all
i am saving by not using ETH_P_ALL is one if statement (checking if
proto is ipv4). If you feel strongly about it we can change our code.
My worry now is if we used this approach then likely someone else in the 
wild used something similar.

I think it boils down again to: if it doesnt confuse the API or add
extra complexity why not allow it and default to ETH_P_ALL?

On your comment that a bpf based proto comparison being faster - the
issue is that the tp proto always happens regardless and ebpf, depending
on your program, may not fit all your code. Example i may actually
decide to have a program for v6 and v4 separately if i wanted
to with current mechanism - at different tc ruleset prios just
so as to work around code/complexity issues.

BTW: tail call limit of 32 provides an upper bound which affects
depth of (generic) parsing.
Does it make sense to allow (maybe on a per-boot) increasing the size?
The fact things run on the stack may be restricting.

It may be more tricky but not impossible either, in recent years some 
(imho) very
interesting and exciting use cases have been implemented and talked 
about e.g. [0-2],
and with the recent linker work there could also be a [e.g. in-kernel] 
collection with
library code that can be pulled in by others aside from using them as 
BPF selftests
as one option. The gain you have with the flexibility [as you know] is 
that it allows
easy integration/orchestration into user space applications and thus 
suitable for
more dynamic envs as with old-style actions. The issue I have with the 
latter is
that they're not scalable enough from a SW datapath / tc fast-path 
perspective given
you then need to fallback to old-style list processing of cls+act 
combinations which
is also not covered / in scope for the libbpf API in terms of their 
setup, and
additionally not all of the BPF features can be used this way either, so 
it'll be very
hard for users to debug why their BPF programs don't work as they're 
expected to.

But also aside from those blockers, the case with this clean slate tc 
BPF API is that
we have a unique chance to overcome the cmdline usability struggles, and 
make it as
straight forward as possible for new generation of users.

   [0] https://linuxplumbersconf.org/event/7/contributions/677/
   [1] https://linuxplumbersconf.org/event/2/contributions/121/
   [2] 
https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF 

I took a quick glance at the refs.

IIUC, your message is "do more with less" i.e restrict choices now
so we can focus on optimizing for speed. Here's my experience.
We have two pragmatic challenges:

1) In a deployment, like some enterprise class data centers, we are
often limited by the kernel and often even the distro you are on. You
cant just upgrade to the latest and greatest without risking voiding
the distro vendors support contract. Big shops with a lot of geniuses
like FB and Google dont have these problems of course - but the majority
out there do.

So even our little program must use supported interfaces (ex: You cant
expect support on RH8.3 for an XDP issue without using the supplied XDP 
lib) to be accepted.

So building in support to use existing infra is useful

2) challenges with ebpf code space and code complexity: Depending
on the complexity, a program with less than 4K instructions may be
rejected by the verifier. IOW, I just cant add all the features
i need _even if i wanted to_.

For this reason working cooperatively with other existing kernel
and user infra makes sense (Ref [2] is doing that for example).
You dont want to rewrite the kernel using ebpf. Extending the kernel
with ebpf makes sense. And of course I dont want to loose performance
but there may be a trade-off sometimes where a little loss in 
performance is justified for gain of a feature makes sense
(the non-da example applies).

Perhaps adding more helpers to interface to the actions and classifiers
is one way forward.

cheers,
jamal

PS: I didnt understand the kernel linker point with BPF selftests.
Pointer?