Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Fri, 07 Oct 2022 16:32:57 +0200

Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:

> On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>>> [...]
>>>>
>>>> I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends
>>>> is done because "that's how things were done in the past".
>>>> imo it was a well intentioned mistake and all networking things (tc, netfilter, etc)
>>>> copy-pasted that cumbersome and hard to use concept.
>>>> Let's throw away that baggage?
>>>> In good set of cases the bpf prog inserter cares whether the prog is first or not.
>>>> Since the first prog returning anything but TC_NEXT will be final.
>>>> I think prog insertion flags: 'I want to run first' vs 'I don't care about order'
>>>> is good enough in practice. Any complex scheme should probably be programmable
>>>> as any policy should. For example in Meta we have 'xdp chainer' logic that is similar
>>>> to libxdp chaining, but we added a feature that allows a prog to jump over another
>>>> prog and continue the chain. Priority concept cannot express that.
>>>> Since we'd have to add some "policy program" anyway for use cases like this
>>>> let's keep things as simple as possible?
>>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>>> And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere"
>>>> in both tcx and xdp ?
>>>> Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc.
>>>> The owner of "run_me_first" will update its prog through bpf_link_update.
>>>> "run_me_anywhere" will add to the end of the chain.
>>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>>> Since only one prog can be enqueued with such flag it will match existing single prog behavior.
>>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere".
>>>> I know it's far from covering plenty of cases that we've discussed for long time,
>>>> but prio concept isn't really covering them either.
>>>> We've struggled enough with single xdp prog, so certainly not advocating for that.
>>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple.
>>>> Both simple versions have their pros and cons and don't cover everything,
>>>> but imo both are better than prio.
>>>
>>> Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two
>>> use cases that should be covered (and actually we kind of do this in this set, too,
>>> with the prios via prio=x vs prio=0). Given users will only be consuming the APIs
>>> via libs like libbpf, this can also be abstracted this way w/o users having to be
>>> aware of prios.
>> 
>> but the patchset tells different story.
>> Prio gets exposed everywhere in uapi all the way to bpftool
>> when it's right there for users to understand.
>> And that's the main problem with it.
>> The user don't want to and don't need to be aware of it,
>> but uapi forces them to pick the priority.
>> 
>>> Anyway, where it gets tricky would be when things depend on ordering,
>>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which
>>> would be sth you can build today via tc BPF: so policy one acts as a prefilter for
>>> various cidr ranges that should be blocked no matter what, then monitoring to sample
>>> what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what
>>> the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send
>>> the result to wireguard dev, so it's encrypted from lb node to backend.
>> 
>> That's all theory. Your cover letter example proves that in
>> real life different service pick the same priority.
>> They simply don't know any better.
>> prio is an unnecessary magic that apps _have_ to pick,
>> so they just copy-paste and everyone ends up using the same.
>> 
>>> For such
>>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's
>>> a case for both scenarios (concrete layout vs loose one), and for latter we could
>>> start off with and internal prio around x (e.g. 16k), so there's room to attach in
>>> front via fixed prio, but also append to end for 'don't care', and that could be
>>> from lib pov the default/main API whereas prio would be some kind of extended one.
>>> Thoughts?
>> 
>> If prio was not part of uapi, like kernel internal somehow,
>> and there was a user space daemon, systemd, or another bpf prog,
>> module, whatever that users would interface to then
>> the proposed implementation of prio would totally make sense.
>> prio as uapi is not that.
>
> A good analogy to this issue might be systemd's unit files.. you specify dependencies
> for your own <unit> file via 'Wants=<unitA>', and ordering via 'Before=<unitB>' and
> 'After=<unitC>' and they refer to other unit files. I think that is generally okay,
> you don't deal with prio numbers, but rather some kind textual representation. However
> user/operator will have to deal with dependencies/ordering one way or another, the
> problem here is that we deal with kernel and loader talks to kernel directly so it
> has no awareness of what else is running or could be running, so apps needs to deal
> with it somehow (and it cannot without external help).

I was thinking a little about how this might work; i.e., how can the
kernel expose the required knobs to allow a system policy to be
implemented without program loading having to talk to anything other
than the syscall API?

How about we only expose prepend/append in the prog attach UAPI, and
then have a kernel function that does the sorting like:

int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct bpf_prog *new_prog, bool append)

where the default implementation just appends/prepends to the array in
progs depending on the value of 'appen'.

And then use the __weak linking trick (or maybe struct_ops with a member
for TXC, another for XDP, etc?) to allow BPF to override the function
wholesale and implement whatever ordering it wants? I.e., allow it can
to just shift around the order of progs in the 'progs' array whenever a
program is loaded/unloaded?

This way, a userspace daemon can implement any policy it wants by just
attaching to that hook, and keeping things like how to express
dependencies as a userspace concern?

-Toke