On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote: > + > +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit, > + struct bpf_prog *nprog, u32 prio, u32 flags) > +{ > + struct bpf_prog_array_item *item, *tmp; > + struct xtc_entry *entry, *peer; > + struct bpf_prog *oprog; > + bool created; > + int i, j; > + > + ASSERT_RTNL(); > + > + entry = dev_xtc_entry_fetch(dev, ingress, &created); > + if (!entry) > + return -ENOMEM; > + for (i = 0; i < limit; i++) { > + item = &entry->items[i]; > + oprog = item->prog; > + if (!oprog) > + break; > + if (item->bpf_priority == prio) { > + if (flags & BPF_F_REPLACE) { > + /* Pairs with READ_ONCE() in xtc_run_progs(). */ > + WRITE_ONCE(item->prog, nprog); > + bpf_prog_put(oprog); > + dev_xtc_entry_prio_set(entry, prio, nprog); > + return prio; > + } > + return -EBUSY; > + } > + } > + if (dev_xtc_entry_total(entry) >= limit) > + return -ENOSPC; > + prio = dev_xtc_entry_prio_new(entry, prio, nprog); > + if (prio < 0) { > + if (created) > + dev_xtc_entry_free(entry); > + return -ENOMEM; > + } > + peer = dev_xtc_entry_peer(entry); > + dev_xtc_entry_clear(peer); > + for (i = 0, j = 0; i < limit; i++, j++) { > + item = &entry->items[i]; > + tmp = &peer->items[j]; > + oprog = item->prog; > + if (!oprog) { > + if (i == j) { > + tmp->prog = nprog; > + tmp->bpf_priority = prio; > + } > + break; > + } else if (item->bpf_priority < prio) { > + tmp->prog = oprog; > + tmp->bpf_priority = item->bpf_priority; > + } else if (item->bpf_priority > prio) { > + if (i == j) { > + tmp->prog = nprog; > + tmp->bpf_priority = prio; > + tmp = &peer->items[++j]; > + } > + tmp->prog = oprog; > + tmp->bpf_priority = item->bpf_priority; > + } > + } > + dev_xtc_entry_update(dev, peer, ingress); > + if (ingress) > + net_inc_ingress_queue(); > + else > + net_inc_egress_queue(); > + xtc_inc(); > + return prio; > +} ... > +static __always_inline enum tc_action_base > +xtc_run(const struct xtc_entry *entry, struct sk_buff *skb, > + const bool needs_mac) > +{ > + const struct bpf_prog_array_item *item; > + const struct bpf_prog *prog; > + int ret = TC_NEXT; > + > + if (needs_mac) > + __skb_push(skb, skb->mac_len); > + item = &entry->items[0]; > + while ((prog = READ_ONCE(item->prog))) { > + bpf_compute_data_pointers(skb); > + ret = bpf_prog_run(prog, skb); > + if (ret != TC_NEXT) > + break; > + item++; > + } > + if (needs_mac) > + __skb_pull(skb, skb->mac_len); > + return xtc_action_code(skb, ret); > +} I cannot help but feel that prio logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". imo it was a well intentioned mistake and all networking things (tc, netfilter, etc) copy-pasted that cumbersome and hard to use concept. Let's throw away that baggage? In good set of cases the bpf prog inserter cares whether the prog is first or not. Since the first prog returning anything but TC_NEXT will be final. I think prog insertion flags: 'I want to run first' vs 'I don't care about order' is good enough in practice. Any complex scheme should probably be programmable as any policy should. For example in Meta we have 'xdp chainer' logic that is similar to libxdp chaining, but we added a feature that allows a prog to jump over another prog and continue the chain. Priority concept cannot express that. Since we'd have to add some "policy program" anyway for use cases like this let's keep things as simple as possible? Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ? And allow bpf progs chaining in the kernel with "run_me_first" vs "run_me_anywhere" in both tcx and xdp ? Naturally "run_me_first" prog will be the only one. No need for F_REPLACE flags, etc. The owner of "run_me_first" will update its prog through bpf_link_update. "run_me_anywhere" will add to the end of the chain. In XDP for compatibility reasons "run_me_first" will be the default. Since only one prog can be enqueued with such flag it will match existing single prog behavior. Well behaving progs will use (like xdp-tcpdump or monitoring progs) will use "run_me_anywhere". I know it's far from covering plenty of cases that we've discussed for long time, but prio concept isn't really covering them either. We've struggled enough with single xdp prog, so certainly not advocating for that. Another alternative is to do: "queue_at_head" vs "queue_at_tail". Just as simple. Both simple versions have their pros and cons and don't cover everything, but imo both are better than prio.