On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
[...]
I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". imo it was a well intentioned mistake and all networking things (tc, netfilter, etc) copy-pasted that cumbersome and hard to use concept. Let's throw away that baggage? In good set of cases the bpf prog inserter cares whether the prog is first or not. Since the first prog returning anything but TC_NEXT will be final. I think prog insertion flags: 'I want to run first' vs 'I don't care about order' is good enough in practice. Any complex scheme should probably be programmable as any policy should. For example in Meta we have 'xdp chainer' logic that is similar to libxdp chaining, but we added a feature that allows a prog to jump over another prog and continue the chain. Priority concept cannot express that. Since we'd have to add some "policy program" anyway for use cases like this let's keep things as simple as possible? Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ? And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere" in both tcx and xdp ? Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc. The owner of "run_me_first" will update its prog through bpf_link_update. "run_me_anywhere" will add to the end of the chain. In XDP for compatibility reasons "run_me_first" will be the default. Since only one prog can be enqueued with such flag it will match existing single prog behavior. Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere". I know it's far from covering plenty of cases that we've discussed for long time, but prio concept isn't really covering them either. We've struggled enough with single xdp prog, so certainly not advocating for that. Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple. Both simple versions have their pros and cons and don't cover everything, but imo both are better than prio.
Yeah, it's kind of tricky, imho. The 'run_me_first' vs 'run_me_anywhere' are two use cases that should be covered (and actually we kind of do this in this set, too, with the prios via prio=x vs prio=0). Given users will only be consuming the APIs via libs like libbpf, this can also be abstracted this way w/o users having to be aware of prios. Anyway, where it gets tricky would be when things depend on ordering, e.g. you have BPF progs doing: policy, monitoring, lb, monitoring, encryption, which would be sth you can build today via tc BPF: so policy one acts as a prefilter for various cidr ranges that should be blocked no matter what, then monitoring to sample what goes into the lb, then lb itself which does snat/dnat, then monitoring to see what the corresponding pkt looks that goes to backend, and maybe encryption to e.g. send the result to wireguard dev, so it's encrypted from lb node to backend. For such example, you'd need prios as the 'run_me_anywhere' doesn't guarantee order, so there's a case for both scenarios (concrete layout vs loose one), and for latter we could start off with and internal prio around x (e.g. 16k), so there's room to attach in front via fixed prio, but also append to end for 'don't care', and that could be from lib pov the default/main API whereas prio would be some kind of extended one. Thoughts? Thanks, Daniel