Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Fri, 07 Oct 2022 19:20:40 +0200

sdf@xxxxxxxxxx writes:

> On 10/07, Toke H�iland-J�rgensen wrote:
>> Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:
>
>> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx>  
>> wrote:
>> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>> >>> [...]
>> >>>>
>> >>>> I cannot help but feel that prio logic copy-paste from old tc,  
>> netfilter and friends
>> >>>> is done because "that's how things were done in the past".
>> >>>> imo it was a well intentioned mistake and all networking things (tc,  
>> netfilter, etc)
>> >>>> copy-pasted that cumbersome and hard to use concept.
>> >>>> Let's throw away that baggage?
>> >>>> In good set of cases the bpf prog inserter cares whether the prog is  
>> first or not.
>> >>>> Since the first prog returning anything but TC_NEXT will be final.
>> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care  
>> about order'
>> >>>> is good enough in practice. Any complex scheme should probably be  
>> programmable
>> >>>> as any policy should. For example in Meta we have 'xdp chainer'  
>> logic that is similar
>> >>>> to libxdp chaining, but we added a feature that allows a prog to  
>> jump over another
>> >>>> prog and continue the chain. Priority concept cannot express that.
>> >>>> Since we'd have to add some "policy program" anyway for use cases  
>> like this
>> >>>> let's keep things as simple as possible?
>> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>> >>>> And allow bpf progs chaining in the kernel with "run_me_first"  
>> vs "run_me_anywhere"
>> >>>> in both tcx and xdp ?
>> >>>> Naturally "run_me_first" prog will be the only one. No need for  
>> F_REPLACE flags, etc.
>> >>>> The owner of "run_me_first" will update its prog through  
>> bpf_link_update.
>> >>>> "run_me_anywhere" will add to the end of the chain.
>> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
>> >>>> Since only one prog can be enqueued with such flag it will match  
>> existing single prog behavior.
>> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)  
>> will use "run_me_anywhere".
>> >>>> I know it's far from covering plenty of cases that we've discussed  
>> for long time,
>> >>>> but prio concept isn't really covering them either.
>> >>>> We've struggled enough with single xdp prog, so certainly not  
>> advocating for that.
>> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".  
>> Just as simple.
>> >>>> Both simple versions have their pros and cons and don't cover  
>> everything,
>> >>>> but imo both are better than prio.
>> >>>
>> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'  
>> vs 'run_me_anywhere' are two
>> >>> use cases that should be covered (and actually we kind of do this in  
>> this set, too,
>> >>> with the prios via prio=x vs prio=0). Given users will only be  
>> consuming the APIs
>> >>> via libs like libbpf, this can also be abstracted this way w/o users  
>> having to be
>> >>> aware of prios.
>> >>
>> >> but the patchset tells different story.
>> >> Prio gets exposed everywhere in uapi all the way to bpftool
>> >> when it's right there for users to understand.
>> >> And that's the main problem with it.
>> >> The user don't want to and don't need to be aware of it,
>> >> but uapi forces them to pick the priority.
>> >>
>> >>> Anyway, where it gets tricky would be when things depend on ordering,
>> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,  
>> encryption, which
>> >>> would be sth you can build today via tc BPF: so policy one acts as a  
>> prefilter for
>> >>> various cidr ranges that should be blocked no matter what, then  
>> monitoring to sample
>> >>> what goes into the lb, then lb itself which does snat/dnat, then  
>> monitoring to see what
>> >>> the corresponding pkt looks that goes to backend, and maybe  
>> encryption to e.g. send
>> >>> the result to wireguard dev, so it's encrypted from lb node to  
>> backend.
>> >>
>> >> That's all theory. Your cover letter example proves that in
>> >> real life different service pick the same priority.
>> >> They simply don't know any better.
>> >> prio is an unnecessary magic that apps _have_ to pick,
>> >> so they just copy-paste and everyone ends up using the same.
>> >>
>> >>> For such
>> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee  
>> order, so there's
>> >>> a case for both scenarios (concrete layout vs loose one), and for  
>> latter we could
>> >>> start off with and internal prio around x (e.g. 16k), so there's room  
>> to attach in
>> >>> front via fixed prio, but also append to end for 'don't care', and  
>> that could be
>> >>> from lib pov the default/main API whereas prio would be some kind of  
>> extended one.
>> >>> Thoughts?
>> >>
>> >> If prio was not part of uapi, like kernel internal somehow,
>> >> and there was a user space daemon, systemd, or another bpf prog,
>> >> module, whatever that users would interface to then
>> >> the proposed implementation of prio would totally make sense.
>> >> prio as uapi is not that.
>> >
>> > A good analogy to this issue might be systemd's unit files.. you  
>> specify dependencies
>> > for your own <unit> file via 'Wants=<unitA>', and ordering  
>> via 'Before=<unitB>' and
>> > 'After=<unitC>' and they refer to other unit files. I think that is  
>> generally okay,
>> > you don't deal with prio numbers, but rather some kind textual  
>> representation. However
>> > user/operator will have to deal with dependencies/ordering one way or  
>> another, the
>> > problem here is that we deal with kernel and loader talks to kernel  
>> directly so it
>> > has no awareness of what else is running or could be running, so apps  
>> needs to deal
>> > with it somehow (and it cannot without external help).
>
>> I was thinking a little about how this might work; i.e., how can the
>> kernel expose the required knobs to allow a system policy to be
>> implemented without program loading having to talk to anything other
>> than the syscall API?
>
>> How about we only expose prepend/append in the prog attach UAPI, and
>> then have a kernel function that does the sorting like:
>
>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct  
>> bpf_prog *new_prog, bool append)
>
>> where the default implementation just appends/prepends to the array in
>> progs depending on the value of 'appen'.
>
>> And then use the __weak linking trick (or maybe struct_ops with a member
>> for TXC, another for XDP, etc?) to allow BPF to override the function
>> wholesale and implement whatever ordering it wants? I.e., allow it can
>> to just shift around the order of progs in the 'progs' array whenever a
>> program is loaded/unloaded?
>
>> This way, a userspace daemon can implement any policy it wants by just
>> attaching to that hook, and keeping things like how to express
>> dependencies as a userspace concern?
>
> What if we do the above, but instead of simple global 'attach first/last',
> the default api would be:
>
> - attach before <target_fd>
> - attach after <target_fd>
> - attach before target_fd=-1 == first
> - attach after target_fd=-1 == last
>
> ?

Hmm, the problem with that is that applications don't generally have an
fd to another application's BPF programs; and obtaining them from an ID
is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
before target *ID*" instead, which could work I guess? But then the
problem becomes that it's racy: the ID you're targeting could get
detached before you attach, so you'll need to be prepared to check that
and retry; and I'm almost certain that applications won't test for this,
so it'll just lead to hard-to-debug heisenbugs. Or am I being too
pessimistic here?

-Toke