On Sat, Oct 08, 2022 at 01:38:54PM +0200, Toke Høiland-Jørgensen wrote: > Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > > > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >> > >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote: > >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> [...] > >> >>>> I was thinking a little about how this might work; i.e., how can the > >> >>>> kernel expose the required knobs to allow a system policy to be > >> >>>> implemented without program loading having to talk to anything other > >> >>>> than the syscall API? > >> >>> > >> >>>> How about we only expose prepend/append in the prog attach UAPI, and > >> >>>> then have a kernel function that does the sorting like: > >> >>> > >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct > >> >>>> bpf_prog *new_prog, bool append) > >> >>> > >> >>>> where the default implementation just appends/prepends to the array in > >> >>>> progs depending on the value of 'appen'. > >> >>> > >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member > >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function > >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can > >> >>>> to just shift around the order of progs in the 'progs' array whenever a > >> >>>> program is loaded/unloaded? > >> >>> > >> >>>> This way, a userspace daemon can implement any policy it wants by just > >> >>>> attaching to that hook, and keeping things like how to express > >> >>>> dependencies as a userspace concern? > >> >>> > >> >>> What if we do the above, but instead of simple global 'attach first/last', > >> >>> the default api would be: > >> >>> > >> >>> - attach before <target_fd> > >> >>> - attach after <target_fd> > >> >>> - attach before target_fd=-1 == first > >> >>> - attach after target_fd=-1 == last > >> >>> > >> >>> ? > >> >> > >> >> Hmm, the problem with that is that applications don't generally have an > >> >> fd to another application's BPF programs; and obtaining them from an ID > >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach > >> >> before target *ID*" instead, which could work I guess? But then the > >> >> problem becomes that it's racy: the ID you're targeting could get > >> >> detached before you attach, so you'll need to be prepared to check that > >> >> and retry; and I'm almost certain that applications won't test for this, > >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too > >> >> pessimistic here? > >> > > >> > I like Stan's proposal and don't see any issue with FD. > >> > It's good to gate specific sequencing with cap_sys_admin. > >> > Also for consistency the FD is better than ID. > >> > > >> > I also like systemd analogy with Before=, After=. > >> > systemd has a ton more ways to specify deps between Units, > >> > but none of them have absolute numbers (which is what priority is). > >> > The only bit I'd tweak in Stan's proposal is: > >> > - attach before <target_fd> > >> > - attach after <target_fd> > >> > - attach before target_fd=0 == first > >> > - attach after target_fd=0 == last > >> > >> I think the before(), after() could work, but the target_fd I have my doubts > >> that it will be practical. Maybe lets walk through a concrete real example. app_a > >> and app_b shipped via container_a resp container_b. Both want to install tc BPF > >> and we (operator/user) want to say that prog from app_b should only be inserted > >> after the one from app_a, never run before; if no prog_a is installed, we ofc just > >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter > >> can only run after the former. How would we get to one anothers target fd? One > >> could use the 0, but not if more programs sit before/after. > > > > I read your desired use case several times and probably still didn't get it. > > Sounds like prog_b can just do after(fd=0) to become last. > > And prog_a can do before(fd=0). > > Whichever the order of attaching (a or b) these two will always > > be in a->b order. > > I agree that it's probably not feasible to have programs themselves > coordinate between themselves except for "install me last/first" type > semantics. > > I.e., the "before/after target_fd" is useful for a single application > that wants to install two programs in a certain order. Or for bpftool > for manual/debugging work. yep > System-wide policy (which includes "two containers both using BPF") is > going to need some kind of policy agent/daemon anyway. And the in-kernel > function override is the only feasible way to do that. yep > > Since the first and any prog returning !TC_NEXT will abort > > the chain we'd need __weak nop orchestrator prog to interpret > > retval for anything to be useful. > > If we also want the orchestrator to interpret return codes, that > probably implies generating a BPF program that does the dispatching, > right? (since the attachment is per-interface we can't reuse the same > one). So maybe we do need to go the route of the (overridable) usermode > helper that gets all the program FDs and generates a BPF dispatcher > program? Or can we do this with a __weak function that emits bytecode > inside the kernel without being unsafe? hid-bpf, cgroup-rstat, netfilter-bpf are facing similar issue. The __weak override with one prog is certainly limiting. And every case needs different demux. I think we need to generalize xdp dispatcher to address this. For example, for the case: __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu) { } we can say that 1st argument to nop function will be used as 'demuxing entity'. Sort of like if we had added a 'prog' pointer to 'struct cgroup', but instead of burning 8 byte in every struct cgroup we can generate 'dispatcher asm' only for specific pointers. In case of fuse-bpf that pointer will be a pointer to hid device and demux will be done based on device. It can be an integer too. The subsystem that defines __weak func can pick whatever int or pointer as a first argument and dispatcher routine will generate code: if (arg1 == constA) progA(arg1, arg2, ...); else if (arg1 == constB) progB(arg1, arg2, ...); ... else nop(); This way the 'nop' property of __weak is preserved until user space passes (constA, progA) tuple to the kernel to generate dispatcher for that __weak hook. > Anyway, I'm OK with deferring the orchestrator mechanism and going with > Stanislav's proposal as an initial API. Great. Looks like we're converging :) Hope Daniel is ok with this direction.