Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 7 Oct 2022 16:34:58 -0700

On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> [...]
> >>>> I was thinking a little about how this might work; i.e., how can the
> >>>> kernel expose the required knobs to allow a system policy to be
> >>>> implemented without program loading having to talk to anything other
> >>>> than the syscall API?
> >>>
> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> >>>> then have a kernel function that does the sorting like:
> >>>
> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >>>> bpf_prog *new_prog, bool append)
> >>>
> >>>> where the default implementation just appends/prepends to the array in
> >>>> progs depending on the value of 'appen'.
> >>>
> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> >>>> to just shift around the order of progs in the 'progs' array whenever a
> >>>> program is loaded/unloaded?
> >>>
> >>>> This way, a userspace daemon can implement any policy it wants by just
> >>>> attaching to that hook, and keeping things like how to express
> >>>> dependencies as a userspace concern?
> >>>
> >>> What if we do the above, but instead of simple global 'attach first/last',
> >>> the default api would be:
> >>>
> >>> - attach before <target_fd>
> >>> - attach after <target_fd>
> >>> - attach before target_fd=-1 == first
> >>> - attach after target_fd=-1 == last
> >>>
> >>> ?
> >>
> >> Hmm, the problem with that is that applications don't generally have an
> >> fd to another application's BPF programs; and obtaining them from an ID
> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> >> before target *ID*" instead, which could work I guess? But then the
> >> problem becomes that it's racy: the ID you're targeting could get
> >> detached before you attach, so you'll need to be prepared to check that
> >> and retry; and I'm almost certain that applications won't test for this,
> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> >> pessimistic here?
> >
> > I like Stan's proposal and don't see any issue with FD.
> > It's good to gate specific sequencing with cap_sys_admin.
> > Also for consistency the FD is better than ID.
> >
> > I also like systemd analogy with Before=, After=.
> > systemd has a ton more ways to specify deps between Units,
> > but none of them have absolute numbers (which is what priority is).
> > The only bit I'd tweak in Stan's proposal is:
> > - attach before <target_fd>
> > - attach after <target_fd>
> > - attach before target_fd=0 == first
> > - attach after target_fd=0 == last
>
> I think the before(), after() could work, but the target_fd I have my doubts
> that it will be practical. Maybe lets walk through a concrete real example. app_a
> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> and we (operator/user) want to say that prog from app_b should only be inserted
> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> can only run after the former. How would we get to one anothers target fd? One
> could use the 0, but not if more programs sit before/after.

I read your desired use case several times and probably still didn't get it.
Sounds like prog_b can just do after(fd=0) to become last.
And prog_a can do before(fd=0).
Whichever the order of attaching (a or b) these two will always
be in a->b order.
Are you saying that there should be no progs between them?
Sure, the daemon could iterate the hook progs, discover prog_id,
get its FD and do before(prog_fd).
The use case sounds hypothetical.
Since the first and any prog returning !TC_NEXT will abort
the chain we'd need __weak nop orchestrator prog to interpret
retval for anything to be useful.
With cgroup-skb we did fancy none/override/multi and what for?
As far as I can see everyone is using 'multi' and all progs are run.
If we did only 'multi' for cgroup it would be just as fine
and we would have avoided all the complexity in the kernel.
Hence I'm advocating for the simplest approach for tcx and xdp.