Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >> >> On 10/7/22 8:59 PM, Alexei Starovoitov wrote: >> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> [...] >> >>>> I was thinking a little about how this might work; i.e., how can the >> >>>> kernel expose the required knobs to allow a system policy to be >> >>>> implemented without program loading having to talk to anything other >> >>>> than the syscall API? >> >>> >> >>>> How about we only expose prepend/append in the prog attach UAPI, and >> >>>> then have a kernel function that does the sorting like: >> >>> >> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct >> >>>> bpf_prog *new_prog, bool append) >> >>> >> >>>> where the default implementation just appends/prepends to the array in >> >>>> progs depending on the value of 'appen'. >> >>> >> >>>> And then use the __weak linking trick (or maybe struct_ops with a member >> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function >> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can >> >>>> to just shift around the order of progs in the 'progs' array whenever a >> >>>> program is loaded/unloaded? >> >>> >> >>>> This way, a userspace daemon can implement any policy it wants by just >> >>>> attaching to that hook, and keeping things like how to express >> >>>> dependencies as a userspace concern? >> >>> >> >>> What if we do the above, but instead of simple global 'attach first/last', >> >>> the default api would be: >> >>> >> >>> - attach before <target_fd> >> >>> - attach after <target_fd> >> >>> - attach before target_fd=-1 == first >> >>> - attach after target_fd=-1 == last >> >>> >> >>> ? >> >> >> >> Hmm, the problem with that is that applications don't generally have an >> >> fd to another application's BPF programs; and obtaining them from an ID >> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach >> >> before target *ID*" instead, which could work I guess? But then the >> >> problem becomes that it's racy: the ID you're targeting could get >> >> detached before you attach, so you'll need to be prepared to check that >> >> and retry; and I'm almost certain that applications won't test for this, >> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too >> >> pessimistic here? >> > >> > I like Stan's proposal and don't see any issue with FD. >> > It's good to gate specific sequencing with cap_sys_admin. >> > Also for consistency the FD is better than ID. >> > >> > I also like systemd analogy with Before=, After=. >> > systemd has a ton more ways to specify deps between Units, >> > but none of them have absolute numbers (which is what priority is). >> > The only bit I'd tweak in Stan's proposal is: >> > - attach before <target_fd> >> > - attach after <target_fd> >> > - attach before target_fd=0 == first >> > - attach after target_fd=0 == last >> >> I think the before(), after() could work, but the target_fd I have my doubts >> that it will be practical. Maybe lets walk through a concrete real example. app_a >> and app_b shipped via container_a resp container_b. Both want to install tc BPF >> and we (operator/user) want to say that prog from app_b should only be inserted >> after the one from app_a, never run before; if no prog_a is installed, we ofc just >> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter >> can only run after the former. How would we get to one anothers target fd? One >> could use the 0, but not if more programs sit before/after. > > I read your desired use case several times and probably still didn't get it. > Sounds like prog_b can just do after(fd=0) to become last. > And prog_a can do before(fd=0). > Whichever the order of attaching (a or b) these two will always > be in a->b order. I agree that it's probably not feasible to have programs themselves coordinate between themselves except for "install me last/first" type semantics. I.e., the "before/after target_fd" is useful for a single application that wants to install two programs in a certain order. Or for bpftool for manual/debugging work. System-wide policy (which includes "two containers both using BPF") is going to need some kind of policy agent/daemon anyway. And the in-kernel function override is the only feasible way to do that. > Since the first and any prog returning !TC_NEXT will abort > the chain we'd need __weak nop orchestrator prog to interpret > retval for anything to be useful. If we also want the orchestrator to interpret return codes, that probably implies generating a BPF program that does the dispatching, right? (since the attachment is per-interface we can't reuse the same one). So maybe we do need to go the route of the (overridable) usermode helper that gets all the program FDs and generates a BPF dispatcher program? Or can we do this with a __weak function that emits bytecode inside the kernel without being unsafe? Anyway, I'm OK with deferring the orchestrator mechanism and going with Stanislav's proposal as an initial API. -Toke