On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > sdf@xxxxxxxxxx writes: > > > On 10/07, Toke H�iland-J�rgensen wrote: > >> Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > > > >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote: > >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> > >> wrote: > >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote: > >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote: > >> >>> [...] > >> >>>> > >> >>>> I cannot help but feel that prio logic copy-paste from old tc, > >> netfilter and friends > >> >>>> is done because "that's how things were done in the past". > >> >>>> imo it was a well intentioned mistake and all networking things (tc, > >> netfilter, etc) > >> >>>> copy-pasted that cumbersome and hard to use concept. > >> >>>> Let's throw away that baggage? > >> >>>> In good set of cases the bpf prog inserter cares whether the prog is > >> first or not. > >> >>>> Since the first prog returning anything but TC_NEXT will be final. > >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care > >> about order' > >> >>>> is good enough in practice. Any complex scheme should probably be > >> programmable > >> >>>> as any policy should. For example in Meta we have 'xdp chainer' > >> logic that is similar > >> >>>> to libxdp chaining, but we added a feature that allows a prog to > >> jump over another > >> >>>> prog and continue the chain. Priority concept cannot express that. > >> >>>> Since we'd have to add some "policy program" anyway for use cases > >> like this > >> >>>> let's keep things as simple as possible? > >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ? > >> >>>> And allow bpf progs chaining in the kernel with "run_me_first" > >> vs "run_me_anywhere" > >> >>>> in both tcx and xdp ? > >> >>>> Naturally "run_me_first" prog will be the only one. No need for > >> F_REPLACE flags, etc. > >> >>>> The owner of "run_me_first" will update its prog through > >> bpf_link_update. > >> >>>> "run_me_anywhere" will add to the end of the chain. > >> >>>> In XDP for compatibility reasons "run_me_first" will be the default. > >> >>>> Since only one prog can be enqueued with such flag it will match > >> existing single prog behavior. > >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) > >> will use "run_me_anywhere". > >> >>>> I know it's far from covering plenty of cases that we've discussed > >> for long time, > >> >>>> but prio concept isn't really covering them either. > >> >>>> We've struggled enough with single xdp prog, so certainly not > >> advocating for that. > >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". > >> Just as simple. > >> >>>> Both simple versions have their pros and cons and don't cover > >> everything, > >> >>>> but imo both are better than prio. > >> >>> > >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first' > >> vs 'run_me_anywhere' are two > >> >>> use cases that should be covered (and actually we kind of do this in > >> this set, too, > >> >>> with the prios via prio=x vs prio=0). Given users will only be > >> consuming the APIs > >> >>> via libs like libbpf, this can also be abstracted this way w/o users > >> having to be > >> >>> aware of prios. > >> >> > >> >> but the patchset tells different story. > >> >> Prio gets exposed everywhere in uapi all the way to bpftool > >> >> when it's right there for users to understand. > >> >> And that's the main problem with it. > >> >> The user don't want to and don't need to be aware of it, > >> >> but uapi forces them to pick the priority. > >> >> > >> >>> Anyway, where it gets tricky would be when things depend on ordering, > >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, > >> encryption, which > >> >>> would be sth you can build today via tc BPF: so policy one acts as a > >> prefilter for > >> >>> various cidr ranges that should be blocked no matter what, then > >> monitoring to sample > >> >>> what goes into the lb, then lb itself which does snat/dnat, then > >> monitoring to see what > >> >>> the corresponding pkt looks that goes to backend, and maybe > >> encryption to e.g. send > >> >>> the result to wireguard dev, so it's encrypted from lb node to > >> backend. > >> >> > >> >> That's all theory. Your cover letter example proves that in > >> >> real life different service pick the same priority. > >> >> They simply don't know any better. > >> >> prio is an unnecessary magic that apps _have_ to pick, > >> >> so they just copy-paste and everyone ends up using the same. > >> >> > >> >>> For such > >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee > >> order, so there's > >> >>> a case for both scenarios (concrete layout vs loose one), and for > >> latter we could > >> >>> start off with and internal prio around x (e.g. 16k), so there's room > >> to attach in > >> >>> front via fixed prio, but also append to end for 'don't care', and > >> that could be > >> >>> from lib pov the default/main API whereas prio would be some kind of > >> extended one. > >> >>> Thoughts? > >> >> > >> >> If prio was not part of uapi, like kernel internal somehow, > >> >> and there was a user space daemon, systemd, or another bpf prog, > >> >> module, whatever that users would interface to then > >> >> the proposed implementation of prio would totally make sense. > >> >> prio as uapi is not that. > >> > > >> > A good analogy to this issue might be systemd's unit files.. you > >> specify dependencies > >> > for your own <unit> file via 'Wants=<unitA>', and ordering > >> via 'Before=<unitB>' and > >> > 'After=<unitC>' and they refer to other unit files. I think that is > >> generally okay, > >> > you don't deal with prio numbers, but rather some kind textual > >> representation. However > >> > user/operator will have to deal with dependencies/ordering one way or > >> another, the > >> > problem here is that we deal with kernel and loader talks to kernel > >> directly so it > >> > has no awareness of what else is running or could be running, so apps > >> needs to deal > >> > with it somehow (and it cannot without external help). > > > >> I was thinking a little about how this might work; i.e., how can the > >> kernel expose the required knobs to allow a system policy to be > >> implemented without program loading having to talk to anything other > >> than the syscall API? > > > >> How about we only expose prepend/append in the prog attach UAPI, and > >> then have a kernel function that does the sorting like: > > > >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct > >> bpf_prog *new_prog, bool append) > > > >> where the default implementation just appends/prepends to the array in > >> progs depending on the value of 'appen'. > > > >> And then use the __weak linking trick (or maybe struct_ops with a member > >> for TXC, another for XDP, etc?) to allow BPF to override the function > >> wholesale and implement whatever ordering it wants? I.e., allow it can > >> to just shift around the order of progs in the 'progs' array whenever a > >> program is loaded/unloaded? > > > >> This way, a userspace daemon can implement any policy it wants by just > >> attaching to that hook, and keeping things like how to express > >> dependencies as a userspace concern? > > > > What if we do the above, but instead of simple global 'attach first/last', > > the default api would be: > > > > - attach before <target_fd> > > - attach after <target_fd> > > - attach before target_fd=-1 == first > > - attach after target_fd=-1 == last > > > > ? > > Hmm, the problem with that is that applications don't generally have an > fd to another application's BPF programs; and obtaining them from an ID > is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach > before target *ID*" instead, which could work I guess? But then the > problem becomes that it's racy: the ID you're targeting could get > detached before you attach, so you'll need to be prepared to check that > and retry; and I'm almost certain that applications won't test for this, > so it'll just lead to hard-to-debug heisenbugs. Or am I being too > pessimistic here? I like Stan's proposal and don't see any issue with FD. It's good to gate specific sequencing with cap_sys_admin. Also for consistency the FD is better than ID. I also like systemd analogy with Before=, After=. systemd has a ton more ways to specify deps between Units, but none of them have absolute numbers (which is what priority is). The only bit I'd tweak in Stan's proposal is: - attach before <target_fd> - attach after <target_fd> - attach before target_fd=0 == first - attach after target_fd=0 == last The attach operation needs to be CAP_NET_ADMIN. Just like we do for BPF_PROG_TYPE_CGROUP_SKB. And we can do the same logic for XDP attaching. Eventually we can add __weak "orchestrator prog", but it would need to not only order progs, but should interpret enum tc_action_base return codes at run-time between progs too.