sdf@xxxxxxxxxx writes: > On 10/07, Toke H�iland-J�rgensen wrote: >> Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote: >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> >> wrote: >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote: >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote: >> >>> [...] >> >>>> >> >>>> I cannot help but feel that prio logic copy-paste from old tc, >> netfilter and friends >> >>>> is done because "that's how things were done in the past". >> >>>> imo it was a well intentioned mistake and all networking things (tc, >> netfilter, etc) >> >>>> copy-pasted that cumbersome and hard to use concept. >> >>>> Let's throw away that baggage? >> >>>> In good set of cases the bpf prog inserter cares whether the prog is >> first or not. >> >>>> Since the first prog returning anything but TC_NEXT will be final. >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care >> about order' >> >>>> is good enough in practice. Any complex scheme should probably be >> programmable >> >>>> as any policy should. For example in Meta we have 'xdp chainer' >> logic that is similar >> >>>> to libxdp chaining, but we added a feature that allows a prog to >> jump over another >> >>>> prog and continue the chain. Priority concept cannot express that. >> >>>> Since we'd have to add some "policy program" anyway for use cases >> like this >> >>>> let's keep things as simple as possible? >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ? >> >>>> And allow bpf progs chaining in the kernel with "run_me_first" >> vs "run_me_anywhere" >> >>>> in both tcx and xdp ? >> >>>> Naturally "run_me_first" prog will be the only one. No need for >> F_REPLACE flags, etc. >> >>>> The owner of "run_me_first" will update its prog through >> bpf_link_update. >> >>>> "run_me_anywhere" will add to the end of the chain. >> >>>> In XDP for compatibility reasons "run_me_first" will be the default. >> >>>> Since only one prog can be enqueued with such flag it will match >> existing single prog behavior. >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs) >> will use "run_me_anywhere". >> >>>> I know it's far from covering plenty of cases that we've discussed >> for long time, >> >>>> but prio concept isn't really covering them either. >> >>>> We've struggled enough with single xdp prog, so certainly not >> advocating for that. >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail". >> Just as simple. >> >>>> Both simple versions have their pros and cons and don't cover >> everything, >> >>>> but imo both are better than prio. >> >>> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first' >> vs 'run_me_anywhere' are two >> >>> use cases that should be covered (and actually we kind of do this in >> this set, too, >> >>> with the prios via prio=x vs prio=0). Given users will only be >> consuming the APIs >> >>> via libs like libbpf, this can also be abstracted this way w/o users >> having to be >> >>> aware of prios. >> >> >> >> but the patchset tells different story. >> >> Prio gets exposed everywhere in uapi all the way to bpftool >> >> when it's right there for users to understand. >> >> And that's the main problem with it. >> >> The user don't want to and don't need to be aware of it, >> >> but uapi forces them to pick the priority. >> >> >> >>> Anyway, where it gets tricky would be when things depend on ordering, >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, >> encryption, which >> >>> would be sth you can build today via tc BPF: so policy one acts as a >> prefilter for >> >>> various cidr ranges that should be blocked no matter what, then >> monitoring to sample >> >>> what goes into the lb, then lb itself which does snat/dnat, then >> monitoring to see what >> >>> the corresponding pkt looks that goes to backend, and maybe >> encryption to e.g. send >> >>> the result to wireguard dev, so it's encrypted from lb node to >> backend. >> >> >> >> That's all theory. Your cover letter example proves that in >> >> real life different service pick the same priority. >> >> They simply don't know any better. >> >> prio is an unnecessary magic that apps _have_ to pick, >> >> so they just copy-paste and everyone ends up using the same. >> >> >> >>> For such >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee >> order, so there's >> >>> a case for both scenarios (concrete layout vs loose one), and for >> latter we could >> >>> start off with and internal prio around x (e.g. 16k), so there's room >> to attach in >> >>> front via fixed prio, but also append to end for 'don't care', and >> that could be >> >>> from lib pov the default/main API whereas prio would be some kind of >> extended one. >> >>> Thoughts? >> >> >> >> If prio was not part of uapi, like kernel internal somehow, >> >> and there was a user space daemon, systemd, or another bpf prog, >> >> module, whatever that users would interface to then >> >> the proposed implementation of prio would totally make sense. >> >> prio as uapi is not that. >> > >> > A good analogy to this issue might be systemd's unit files.. you >> specify dependencies >> > for your own <unit> file via 'Wants=<unitA>', and ordering >> via 'Before=<unitB>' and >> > 'After=<unitC>' and they refer to other unit files. I think that is >> generally okay, >> > you don't deal with prio numbers, but rather some kind textual >> representation. However >> > user/operator will have to deal with dependencies/ordering one way or >> another, the >> > problem here is that we deal with kernel and loader talks to kernel >> directly so it >> > has no awareness of what else is running or could be running, so apps >> needs to deal >> > with it somehow (and it cannot without external help). > >> I was thinking a little about how this might work; i.e., how can the >> kernel expose the required knobs to allow a system policy to be >> implemented without program loading having to talk to anything other >> than the syscall API? > >> How about we only expose prepend/append in the prog attach UAPI, and >> then have a kernel function that does the sorting like: > >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct >> bpf_prog *new_prog, bool append) > >> where the default implementation just appends/prepends to the array in >> progs depending on the value of 'appen'. > >> And then use the __weak linking trick (or maybe struct_ops with a member >> for TXC, another for XDP, etc?) to allow BPF to override the function >> wholesale and implement whatever ordering it wants? I.e., allow it can >> to just shift around the order of progs in the 'progs' array whenever a >> program is loaded/unloaded? > >> This way, a userspace daemon can implement any policy it wants by just >> attaching to that hook, and keeping things like how to express >> dependencies as a userspace concern? > > What if we do the above, but instead of simple global 'attach first/last', > the default api would be: > > - attach before <target_fd> > - attach after <target_fd> > - attach before target_fd=-1 == first > - attach after target_fd=-1 == last > > ? Hmm, the problem with that is that applications don't generally have an fd to another application's BPF programs; and obtaining them from an ID is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach before target *ID*" instead, which could work I guess? But then the problem becomes that it's racy: the ID you're targeting could get detached before you attach, so you'll need to be prepared to check that and retry; and I'm almost certain that applications won't test for this, so it'll just lead to hard-to-debug heisenbugs. Or am I being too pessimistic here? -Toke