On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > This commit introduces a bpf_link based kernel API for creating tc > filters and using the cls_bpf classifier. Only a subset of what netlink > API offers is supported, things like TCA_BPF_POLICE, TCA_RATE and > embedded actions are unsupported. > > The kernel API and the libbpf wrapper added in a subsequent patch are > more opinionated and mirror the semantics of low level netlink based > TC-BPF API, i.e. always setting direct action mode, always setting > protocol to ETH_P_ALL, and only exposing handle and priority as the > variables the user can control. We add an additional gen_flags parameter > though to allow for offloading use cases. It would be trivial to extend > the current API to support specifying other attributes in the future, > but for now I'm sticking how we want to push usage. > > The semantics around bpf_link support are as follows: > > A user can create a classifier attached to a filter using the bpf_link > API, after which changing it and deleting it only happens through the > bpf_link API. It is not possible to bind the bpf_link to existing > filter, and any such attempt will fail with EEXIST. Hence EEXIST can be > returned in two cases, when existing bpf_link owned filter exists, or > existing netlink owned filter exists. > > Removing bpf_link owned filter from netlink returns EPERM, denoting that > netlink is locked out from filter manipulation when bpf_link is > involved. > > Whenever a filter is detached due to chain removal, or qdisc tear down, > or net_device shutdown, the bpf_link becomes automatically detached. > > In this way, the netlink API and bpf_link creation path are exclusive > and don't stomp over one another. Filters created using bpf_link API > cannot be replaced by netlink API, and filters created by netlink API are > never replaced by bpf_link. Netfilter also cannot detach bpf_link filters. > > We serialize all changes dover rtnl_lock as cls_bpf API doesn't support the > unlocked classifier API. > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > --- > include/linux/bpf_types.h | 3 + > include/net/pkt_cls.h | 13 ++ > include/net/sch_generic.h | 6 +- > include/uapi/linux/bpf.h | 15 +++ > kernel/bpf/syscall.c | 10 +- > net/sched/cls_api.c | 138 ++++++++++++++++++++- > net/sched/cls_bpf.c | 247 +++++++++++++++++++++++++++++++++++++- > 7 files changed, 426 insertions(+), 6 deletions(-) > [...] > +static int cls_bpf_link_change(struct net *net, struct tcf_proto *tp, > + struct bpf_prog *filter, void **arg, > + u32 handle, u32 gen_flags) > +{ > + struct cls_bpf_head *head = rtnl_dereference(tp->root); > + struct cls_bpf_prog *oldprog = *arg, *prog; > + struct bpf_link_primer primer; > + struct cls_bpf_link *link; > + int ret; > + > + if (gen_flags & ~CLS_BPF_SUPPORTED_GEN_FLAGS) > + return -EINVAL; > + > + if (oldprog) > + return -EEXIST; > + > + prog = kzalloc(sizeof(*prog), GFP_KERNEL); > + if (!prog) > + return -ENOMEM; > + > + link = kzalloc(sizeof(*link), GFP_KERNEL); > + if (!link) { > + ret = -ENOMEM; > + goto err_prog; > + } > + > + bpf_link_init(&link->link, BPF_LINK_TYPE_TC, &cls_bpf_link_ops, > + filter); > + > + ret = bpf_link_prime(&link->link, &primer); > + if (ret < 0) > + goto err_link; > + > + /* We don't init exts to save on memory, but we still need to store the > + * net_ns pointer, as during delete whether the deletion work will be > + * queued or executed inline depends on the refcount of net_ns. In > + * __cls_bpf_delete the reference is taken to keep the action IDR alive > + * (which we don't require), but its maybe_get_net also allows us to > + * detect whether we are being invoked in netns destruction path or not. > + * In the former case deletion will have to be done synchronously. > + * > + * Leaving it NULL would prevent us from doing deletion work > + * asynchronously, so set it here. > + * > + * On the tcf_classify side, exts->actions are not touched for > + * exts_integrated progs, so we should be good. > + */ > + prog->exts.net = net; > + > + ret = __cls_bpf_alloc_idr(head, handle, prog, oldprog); > + if (ret < 0) > + goto err_primer; > + > + prog->exts_integrated = true; > + prog->bpf_link = link; > + prog->filter = filter; > + prog->tp = tp; > + link->prog = prog; > + > + prog->bpf_name = cls_bpf_link_name(filter->aux->id, filter->aux->name); > + if (!prog->bpf_name) { > + ret = -ENOMEM; > + goto err_idr; > + } > + > + ret = __cls_bpf_change(head, tp, prog, oldprog, NULL); > + if (ret < 0) > + goto err_name; > + > + bpf_prog_inc(filter); > + > + if (filter->dst_needed) > + tcf_block_netif_keep_dst(tp->chain->block); > + > + return bpf_link_settle(&primer); > + > +err_name: > + kfree(prog->bpf_name); > +err_idr: > + idr_remove(&head->handle_idr, prog->handle); > +err_primer: > + bpf_link_cleanup(&primer); once you prime the link, you can't kfree() it, you do only bpf_link_cleanup() and it will handle eventually freeing it. So if you look at other places doing bpf_link, they set link = NULL after bpf_link_cleanup() to avoid directly freeing. > +err_link: > + kfree(link); > +err_prog: > + kfree(prog); > + return ret; > +} > + > static struct tcf_proto_ops cls_bpf_ops __read_mostly = { > .kind = "bpf", > .owner = THIS_MODULE, > @@ -729,6 +973,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = { > .reoffload = cls_bpf_reoffload, > .dump = cls_bpf_dump, > .bind_class = cls_bpf_bind_class, > + .bpf_link_change = cls_bpf_link_change, > }; > > static int __init cls_bpf_init_mod(void) > -- > 2.31.1 >