Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 7 Oct 2022 11:59:55 -0700

On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>
> sdf@xxxxxxxxxx writes:
>
> > On 10/07, Toke H�iland-J�rgensen wrote:
> >> Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:
> >
> >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> >> wrote:
> >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >> >>> [...]
> >> >>>>
> >> >>>> I cannot help but feel that prio logic copy-paste from old tc,
> >> netfilter and friends
> >> >>>> is done because "that's how things were done in the past".
> >> >>>> imo it was a well intentioned mistake and all networking things (tc,
> >> netfilter, etc)
> >> >>>> copy-pasted that cumbersome and hard to use concept.
> >> >>>> Let's throw away that baggage?
> >> >>>> In good set of cases the bpf prog inserter cares whether the prog is
> >> first or not.
> >> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care
> >> about order'
> >> >>>> is good enough in practice. Any complex scheme should probably be
> >> programmable
> >> >>>> as any policy should. For example in Meta we have 'xdp chainer'
> >> logic that is similar
> >> >>>> to libxdp chaining, but we added a feature that allows a prog to
> >> jump over another
> >> >>>> prog and continue the chain. Priority concept cannot express that.
> >> >>>> Since we'd have to add some "policy program" anyway for use cases
> >> like this
> >> >>>> let's keep things as simple as possible?
> >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> >> >>>> And allow bpf progs chaining in the kernel with "run_me_first"
> >> vs "run_me_anywhere"
> >> >>>> in both tcx and xdp ?
> >> >>>> Naturally "run_me_first" prog will be the only one. No need for
> >> F_REPLACE flags, etc.
> >> >>>> The owner of "run_me_first" will update its prog through
> >> bpf_link_update.
> >> >>>> "run_me_anywhere" will add to the end of the chain.
> >> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
> >> >>>> Since only one prog can be enqueued with such flag it will match
> >> existing single prog behavior.
> >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)
> >> will use "run_me_anywhere".
> >> >>>> I know it's far from covering plenty of cases that we've discussed
> >> for long time,
> >> >>>> but prio concept isn't really covering them either.
> >> >>>> We've struggled enough with single xdp prog, so certainly not
> >> advocating for that.
> >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".
> >> Just as simple.
> >> >>>> Both simple versions have their pros and cons and don't cover
> >> everything,
> >> >>>> but imo both are better than prio.
> >> >>>
> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'
> >> vs 'run_me_anywhere' are two
> >> >>> use cases that should be covered (and actually we kind of do this in
> >> this set, too,
> >> >>> with the prios via prio=x vs prio=0). Given users will only be
> >> consuming the APIs
> >> >>> via libs like libbpf, this can also be abstracted this way w/o users
> >> having to be
> >> >>> aware of prios.
> >> >>
> >> >> but the patchset tells different story.
> >> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> >> when it's right there for users to understand.
> >> >> And that's the main problem with it.
> >> >> The user don't want to and don't need to be aware of it,
> >> >> but uapi forces them to pick the priority.
> >> >>
> >> >>> Anyway, where it gets tricky would be when things depend on ordering,
> >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,
> >> encryption, which
> >> >>> would be sth you can build today via tc BPF: so policy one acts as a
> >> prefilter for
> >> >>> various cidr ranges that should be blocked no matter what, then
> >> monitoring to sample
> >> >>> what goes into the lb, then lb itself which does snat/dnat, then
> >> monitoring to see what
> >> >>> the corresponding pkt looks that goes to backend, and maybe
> >> encryption to e.g. send
> >> >>> the result to wireguard dev, so it's encrypted from lb node to
> >> backend.
> >> >>
> >> >> That's all theory. Your cover letter example proves that in
> >> >> real life different service pick the same priority.
> >> >> They simply don't know any better.
> >> >> prio is an unnecessary magic that apps _have_ to pick,
> >> >> so they just copy-paste and everyone ends up using the same.
> >> >>
> >> >>> For such
> >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee
> >> order, so there's
> >> >>> a case for both scenarios (concrete layout vs loose one), and for
> >> latter we could
> >> >>> start off with and internal prio around x (e.g. 16k), so there's room
> >> to attach in
> >> >>> front via fixed prio, but also append to end for 'don't care', and
> >> that could be
> >> >>> from lib pov the default/main API whereas prio would be some kind of
> >> extended one.
> >> >>> Thoughts?
> >> >>
> >> >> If prio was not part of uapi, like kernel internal somehow,
> >> >> and there was a user space daemon, systemd, or another bpf prog,
> >> >> module, whatever that users would interface to then
> >> >> the proposed implementation of prio would totally make sense.
> >> >> prio as uapi is not that.
> >> >
> >> > A good analogy to this issue might be systemd's unit files.. you
> >> specify dependencies
> >> > for your own <unit> file via 'Wants=<unitA>', and ordering
> >> via 'Before=<unitB>' and
> >> > 'After=<unitC>' and they refer to other unit files. I think that is
> >> generally okay,
> >> > you don't deal with prio numbers, but rather some kind textual
> >> representation. However
> >> > user/operator will have to deal with dependencies/ordering one way or
> >> another, the
> >> > problem here is that we deal with kernel and loader talks to kernel
> >> directly so it
> >> > has no awareness of what else is running or could be running, so apps
> >> needs to deal
> >> > with it somehow (and it cannot without external help).
> >
> >> I was thinking a little about how this might work; i.e., how can the
> >> kernel expose the required knobs to allow a system policy to be
> >> implemented without program loading having to talk to anything other
> >> than the syscall API?
> >
> >> How about we only expose prepend/append in the prog attach UAPI, and
> >> then have a kernel function that does the sorting like:
> >
> >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >> bpf_prog *new_prog, bool append)
> >
> >> where the default implementation just appends/prepends to the array in
> >> progs depending on the value of 'appen'.
> >
> >> And then use the __weak linking trick (or maybe struct_ops with a member
> >> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> to just shift around the order of progs in the 'progs' array whenever a
> >> program is loaded/unloaded?
> >
> >> This way, a userspace daemon can implement any policy it wants by just
> >> attaching to that hook, and keeping things like how to express
> >> dependencies as a userspace concern?
> >
> > What if we do the above, but instead of simple global 'attach first/last',
> > the default api would be:
> >
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=-1 == first
> > - attach after target_fd=-1 == last
> >
> > ?
>
> Hmm, the problem with that is that applications don't generally have an
> fd to another application's BPF programs; and obtaining them from an ID
> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> before target *ID*" instead, which could work I guess? But then the
> problem becomes that it's racy: the ID you're targeting could get
> detached before you attach, so you'll need to be prepared to check that
> and retry; and I'm almost certain that applications won't test for this,
> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> pessimistic here?

I like Stan's proposal and don't see any issue with FD.
It's good to gate specific sequencing with cap_sys_admin.
Also for consistency the FD is better than ID.

I also like systemd analogy with Before=, After=.
systemd has a ton more ways to specify deps between Units,
but none of them have absolute numbers (which is what priority is).
The only bit I'd tweak in Stan's proposal is:
- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=0 == first
- attach after target_fd=0 == last

The attach operation needs to be CAP_NET_ADMIN.
Just like we do for BPF_PROG_TYPE_CGROUP_SKB.

And we can do the same logic for XDP attaching.
Eventually we can add __weak "orchestrator prog",
but it would need to not only order progs, but should
interpret enum tc_action_base return codes at run-time
between progs too.