Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sat, 8 Oct 2022 13:38:32 -0700

On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes:
> 
> > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>
> >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
> >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >> [...]
> >> >>>> I was thinking a little about how this might work; i.e., how can the
> >> >>>> kernel expose the required knobs to allow a system policy to be
> >> >>>> implemented without program loading having to talk to anything other
> >> >>>> than the syscall API?
> >> >>>
> >> >>>> How about we only expose prepend/append in the prog attach UAPI, and
> >> >>>> then have a kernel function that does the sorting like:
> >> >>>
> >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
> >> >>>> bpf_prog *new_prog, bool append)
> >> >>>
> >> >>>> where the default implementation just appends/prepends to the array in
> >> >>>> progs depending on the value of 'appen'.
> >> >>>
> >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
> >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
> >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
> >> >>>> to just shift around the order of progs in the 'progs' array whenever a
> >> >>>> program is loaded/unloaded?
> >> >>>
> >> >>>> This way, a userspace daemon can implement any policy it wants by just
> >> >>>> attaching to that hook, and keeping things like how to express
> >> >>>> dependencies as a userspace concern?
> >> >>>
> >> >>> What if we do the above, but instead of simple global 'attach first/last',
> >> >>> the default api would be:
> >> >>>
> >> >>> - attach before <target_fd>
> >> >>> - attach after <target_fd>
> >> >>> - attach before target_fd=-1 == first
> >> >>> - attach after target_fd=-1 == last
> >> >>>
> >> >>> ?
> >> >>
> >> >> Hmm, the problem with that is that applications don't generally have an
> >> >> fd to another application's BPF programs; and obtaining them from an ID
> >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
> >> >> before target *ID*" instead, which could work I guess? But then the
> >> >> problem becomes that it's racy: the ID you're targeting could get
> >> >> detached before you attach, so you'll need to be prepared to check that
> >> >> and retry; and I'm almost certain that applications won't test for this,
> >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
> >> >> pessimistic here?
> >> >
> >> > I like Stan's proposal and don't see any issue with FD.
> >> > It's good to gate specific sequencing with cap_sys_admin.
> >> > Also for consistency the FD is better than ID.
> >> >
> >> > I also like systemd analogy with Before=, After=.
> >> > systemd has a ton more ways to specify deps between Units,
> >> > but none of them have absolute numbers (which is what priority is).
> >> > The only bit I'd tweak in Stan's proposal is:
> >> > - attach before <target_fd>
> >> > - attach after <target_fd>
> >> > - attach before target_fd=0 == first
> >> > - attach after target_fd=0 == last
> >>
> >> I think the before(), after() could work, but the target_fd I have my doubts
> >> that it will be practical. Maybe lets walk through a concrete real example. app_a
> >> and app_b shipped via container_a resp container_b. Both want to install tc BPF
> >> and we (operator/user) want to say that prog from app_b should only be inserted
> >> after the one from app_a, never run before; if no prog_a is installed, we ofc just
> >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
> >> can only run after the former. How would we get to one anothers target fd? One
> >> could use the 0, but not if more programs sit before/after.
> >
> > I read your desired use case several times and probably still didn't get it.
> > Sounds like prog_b can just do after(fd=0) to become last.
> > And prog_a can do before(fd=0).
> > Whichever the order of attaching (a or b) these two will always
> > be in a->b order.
> 
> I agree that it's probably not feasible to have programs themselves
> coordinate between themselves except for "install me last/first" type
> semantics.
> 
> I.e., the "before/after target_fd" is useful for a single application
> that wants to install two programs in a certain order. Or for bpftool
> for manual/debugging work.

yep

> System-wide policy (which includes "two containers both using BPF") is
> going to need some kind of policy agent/daemon anyway. And the in-kernel
> function override is the only feasible way to do that.

yep

> > Since the first and any prog returning !TC_NEXT will abort
> > the chain we'd need __weak nop orchestrator prog to interpret
> > retval for anything to be useful.
> 
> If we also want the orchestrator to interpret return codes, that
> probably implies generating a BPF program that does the dispatching,
> right? (since the attachment is per-interface we can't reuse the same
> one). So maybe we do need to go the route of the (overridable) usermode
> helper that gets all the program FDs and generates a BPF dispatcher
> program? Or can we do this with a __weak function that emits bytecode
> inside the kernel without being unsafe?

hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue.
The __weak override with one prog is certainly limiting.
And every case needs different demux.
I think we need to generalize xdp dispatcher to address this.
For example, for the case:
__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                                     struct cgroup *parent, int cpu)
{
}

we can say that 1st argument to nop function will be used as
'demuxing entity'.
Sort of like if we had added a 'prog' pointer to 'struct cgroup',
but instead of burning 8 byte in every struct cgroup we can generate
'dispatcher asm' only for specific pointers.
In case of fuse-bpf that pointer will be a pointer to hid device and
demux will be done based on device. It can be an integer too.
The subsystem that defines __weak func can pick whatever int or pointer
as a first argument and dispatcher routine will generate code:
if (arg1 == constA) progA(arg1, arg2, ...);
else if (arg1 == constB) progB(arg1, arg2, ...);
...
else nop();

This way the 'nop' property of __weak is preserved until user space
passes (constA, progA) tuple to the kernel to generate dispatcher
for that __weak hook.

> Anyway, I'm OK with deferring the orchestrator mechanism and going with
> Stanislav's proposal as an initial API.

Great. Looks like we're converging :) Hope Daniel is ok with this direction.