On Tue, Oct 4, 2022 at 4:12 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > This work adds BPF links for tc. As a recap, a BPF link represents the attachment > of a BPF program to a BPF hook point. The BPF link holds a single reference to > keep BPF program alive. Moreover, hook points do not reference a BPF link, only > the application's fd or pinning does. A BPF link holds meta-data specific to > attachment and implements operations for link creation, (atomic) BPF program > update, detachment and introspection. > > The motivation for BPF links for tc BPF programs is multi-fold, for example: > > - "It's especially important for applications that are deployed fleet-wide > and that don't "control" hosts they are deployed to. If such application > crashes and no one notices and does anything about that, BPF program will > keep running draining resources or even just, say, dropping packets. We > at FB had outages due to such permanent BPF attachment semantics. With > fd-based BPF link we are getting a framework, which allows safe, auto- > detachable behavior by default, unless application explicitly opts in by > pinning the BPF link." [0] > > - From Cilium-side the tc BPF programs we attach to host-facing veth devices > and phys devices build the core datapath for Kubernetes Pods, and they > implement forwarding, load-balancing, policy, EDT-management, etc, within > BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently > experienced hard-to-debug issues in a user's staging environment where > another Kubernetes application using tc BPF attached to the same prio/handle > of cls_bpf, wiping all Cilium-based BPF programs from underneath it. The > goal is to establish a clear/safe ownership model via links which cannot > accidentally be overridden. [1] > > BPF links for tc can co-exist with non-link attachments, and the semantics are > in line also with XDP links: BPF links cannot replace other BPF links, BPF > links cannot replace non-BPF links, non-BPF links cannot replace BPF links and > lastly only non-BPF links can replace non-BPF links. In case of Cilium, this > would solve mentioned issue of safe ownership model as 3rd party applications > would not be able to accidentally wipe Cilium programs, even if they are not > BPF link aware. > > Earlier attempts [2] have tried to integrate BPF links into core tc machinery > to solve cls_bpf, which has been intrusive to the generic tc kernel API with > extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could > be wiped from the qdisc also. Locking a tc BPF program in place this way, is > getting into layering hacks given the two object models are vastly different. > We chose to implement a prerequisite of the fd-based tc BPF attach API, so > that the BPF link implementation fits in naturally similar to other link types > which are fd-based and without the need for changing core tc internal APIs. > > BPF programs for tc can then be successively migrated from cls_bpf to the new > tc BPF link without needing to change the program's source code, just the BPF > loader mechanics for attaching. > > [0] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@xxxxxxxxxxxxxx/ > [1] https://lpc.events/event/16/contributions/1353/ > [2] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@xxxxxxxxx/ > > Co-developed-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> > Signed-off-by: Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > --- have you considered supporting BPF cookie from the outset? It should be trivial if you remove union from bpf_prog_array_item. If not, then we should reject LINK_CREATE if bpf_cookie is non-zero. > include/linux/bpf.h | 5 +- > include/net/xtc.h | 14 ++++ > include/uapi/linux/bpf.h | 5 ++ > kernel/bpf/net.c | 116 ++++++++++++++++++++++++++++++--- > kernel/bpf/syscall.c | 3 + > tools/include/uapi/linux/bpf.h | 5 ++ > 6 files changed, 139 insertions(+), 9 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 71e5f43db378..226a74f65704 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1473,7 +1473,10 @@ struct bpf_prog_array_item { > union { > struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]; > u64 bpf_cookie; > - u32 bpf_priority; > + struct { > + u32 bpf_priority; > + u32 bpf_id; this is link_id, is that right? should we name it as such? > + }; > }; > }; > [...] > diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c > index ab9a9dee615b..22b7a9b05483 100644 > --- a/kernel/bpf/net.c > +++ b/kernel/bpf/net.c > @@ -8,7 +8,7 @@ > #include <net/xtc.h> > > static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit, > - struct bpf_prog *nprog, u32 prio, u32 flags) > + u32 id, struct bpf_prog *nprog, u32 prio, u32 flags) similarly here, id -> link_id or something like that, it's quite confusing what kind of ID it is otherwise > { > struct bpf_prog_array_item *item, *tmp; > struct xtc_entry *entry, *peer; > @@ -27,10 +27,13 @@ static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32 limit, > if (!oprog) > break; > if (item->bpf_priority == prio) { > - if (flags & BPF_F_REPLACE) { > + if (item->bpf_id == id && > + (flags & BPF_F_REPLACE)) { [...]