Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

sdf@xxxxxxxxxx · Fri, 7 Oct 2022 09:55:33 -0700

On 10/07, Toke H�iland-J�rgensen wrote:
Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:

> On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx>  
wrote:
>>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>>> [...]
>>>>
>>>> I cannot help but feel that prio logic copy-paste from old tc,  
netfilter and friends
>>>> is done because "that's how things were done in the past".
>>>> imo it was a well intentioned mistake and all networking things (tc,  
netfilter, etc)
>>>> copy-pasted that cumbersome and hard to use concept.
>>>> Let's throw away that baggage?
>>>> In good set of cases the bpf prog inserter cares whether the prog is  
first or not.
>>>> Since the first prog returning anything but TC_NEXT will be final.
>>>> I think prog insertion flags: 'I want to run first' vs 'I don't care  
about order'
>>>> is good enough in practice. Any complex scheme should probably be  
programmable
>>>> as any policy should. For example in Meta we have 'xdp chainer'  
logic that is similar
>>>> to libxdp chaining, but we added a feature that allows a prog to  
jump over another
>>>> prog and continue the chain. Priority concept cannot express that.
>>>> Since we'd have to add some "policy program" anyway for use cases  
like this
>>>> let's keep things as simple as possible?
>>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>>> And allow bpf progs chaining in the kernel with "run_me_first"  
vs "run_me_anywhere"
>>>> in both tcx and xdp ?
>>>> Naturally "run_me_first" prog will be the only one. No need for  
F_REPLACE flags, etc.
>>>> The owner of "run_me_first" will update its prog through  
bpf_link_update.
>>>> "run_me_anywhere" will add to the end of the chain.
>>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>>> Since only one prog can be enqueued with such flag it will match  
existing single prog behavior.
>>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)  
will use "run_me_anywhere".
>>>> I know it's far from covering plenty of cases that we've discussed  
for long time,
>>>> but prio concept isn't really covering them either.
>>>> We've struggled enough with single xdp prog, so certainly not  
advocating for that.
>>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".  
Just as simple.
>>>> Both simple versions have their pros and cons and don't cover  
everything,
>>>> but imo both are better than prio.
>>>
>>> Yeah, it's kind of tricky, imho. The 'run_me_first'  
vs 'run_me_anywhere' are two
>>> use cases that should be covered (and actually we kind of do this in  
this set, too,
>>> with the prios via prio=x vs prio=0). Given users will only be  
consuming the APIs
>>> via libs like libbpf, this can also be abstracted this way w/o users  
having to be
>>> aware of prios.
>>
>> but the patchset tells different story.
>> Prio gets exposed everywhere in uapi all the way to bpftool
>> when it's right there for users to understand.
>> And that's the main problem with it.
>> The user don't want to and don't need to be aware of it,
>> but uapi forces them to pick the priority.
>>
>>> Anyway, where it gets tricky would be when things depend on ordering,
>>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,  
encryption, which
>>> would be sth you can build today via tc BPF: so policy one acts as a  
prefilter for
>>> various cidr ranges that should be blocked no matter what, then  
monitoring to sample
>>> what goes into the lb, then lb itself which does snat/dnat, then  
monitoring to see what
>>> the corresponding pkt looks that goes to backend, and maybe  
encryption to e.g. send
>>> the result to wireguard dev, so it's encrypted from lb node to  
backend.
>>
>> That's all theory. Your cover letter example proves that in
>> real life different service pick the same priority.
>> They simply don't know any better.
>> prio is an unnecessary magic that apps _have_ to pick,
>> so they just copy-paste and everyone ends up using the same.
>>
>>> For such
>>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee  
order, so there's
>>> a case for both scenarios (concrete layout vs loose one), and for  
latter we could
>>> start off with and internal prio around x (e.g. 16k), so there's room  
to attach in
>>> front via fixed prio, but also append to end for 'don't care', and  
that could be
>>> from lib pov the default/main API whereas prio would be some kind of  
extended one.
>>> Thoughts?
>>
>> If prio was not part of uapi, like kernel internal somehow,
>> and there was a user space daemon, systemd, or another bpf prog,
>> module, whatever that users would interface to then
>> the proposed implementation of prio would totally make sense.
>> prio as uapi is not that.
>
> A good analogy to this issue might be systemd's unit files.. you  
specify dependencies
> for your own <unit> file via 'Wants=<unitA>', and ordering  
via 'Before=<unitB>' and
> 'After=<unitC>' and they refer to other unit files. I think that is  
generally okay,
> you don't deal with prio numbers, but rather some kind textual  
representation. However
> user/operator will have to deal with dependencies/ordering one way or  
another, the
> problem here is that we deal with kernel and loader talks to kernel  
directly so it
> has no awareness of what else is running or could be running, so apps  
needs to deal
> with it somehow (and it cannot without external help).

I was thinking a little about how this might work; i.e., how can the
kernel expose the required knobs to allow a system policy to be
implemented without program loading having to talk to anything other
than the syscall API?

How about we only expose prepend/append in the prog attach UAPI, and
then have a kernel function that does the sorting like:

int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct  
bpf_prog *new_prog, bool append)

where the default implementation just appends/prepends to the array in
progs depending on the value of 'appen'.

And then use the __weak linking trick (or maybe struct_ops with a member
for TXC, another for XDP, etc?) to allow BPF to override the function
wholesale and implement whatever ordering it wants? I.e., allow it can
to just shift around the order of progs in the 'progs' array whenever a
program is loaded/unloaded?

This way, a userspace daemon can implement any policy it wants by just
attaching to that hook, and keeping things like how to express
dependencies as a userspace concern?

What if we do the above, but instead of simple global 'attach first/last',
the default api would be:

- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=-1 == first
- attach after target_fd=-1 == last

?

That might be flexible enough by default to allow users to
append/prepend to any existing program in the chain (say, for
monitoring). Flexible enough for some central daemons to do
systemd-style policy. And, with bpf_add_new_tcx_prog, flexible
enough to implement any policy?