On Fri, Jun 10, 2022 at 5:58 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > On Fri, Jun 10, 2022 at 05:54:27AM IST, Joanne Koong wrote: > > On Thu, Jun 3, 2021 at 11:31 PM Kumar Kartikeya Dwivedi > > <memxor@xxxxxxxxx> wrote: > > > > > > This is the second (non-RFC) version. > > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > > replace such filters, but the bpf_link is severed on indirect destruction of the > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > > pinning approach can be used. > > > > > > The individual patches contain more details and comments, but the overall kernel > > > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > > > merged recently. This means that we start by always setting direct action mode, > > > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > > > options in the future, they can be easily exposed through the bpf_link API in > > > the future. > > > > > > Patch 1 refactors cls_bpf change function to extract two helpers that will be > > > reused in bpf_link creation. > > > > > > Patch 2 exports some bpf_link management functions to modules. This is needed > > > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to > > > tcf_proto would be weird, because the update path has to replace offloaded bpf > > > prog, which happens using internal cls_bpf helpers, and would in general be more > > > code to abstract over an operation that is unlikely to be implemented for other > > > filter types. > > > > > > Patch 3 adds the main bpf_link API. A function in cls_api takes care of > > > obtaining block reference, creating the filter object, and then calls the > > > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after > > > setting up the internal structures. An optimization is made to not keep around > > > resources for extended actions, which is explained in a code comment as it wasn't > > > immediately obvious. > > > > > > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports > > > replacing the bpf_prog, we can skip tc filter's change path by reusing the > > > filter object but swapping its bpf_prog. This takes care of replacing the > > > offloaded prog as well (if that fails, update is aborted). So far however, > > > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter > > > would never be modified concurrently. This is no longer true, and to not > > > penalize the classify hot path, we also cannot impose serialization around > > > its load. Hence the load is changed to READ_ONCE, so that the pointer value is > > > always consistent. Due to invocation in a RCU critical section, the lifetime of > > > the prog is guaranteed for the duration of the call. > > > > > > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning > > > function to libbpf. > > > > > > Patch 7 adds a selftest that exercises all possible problematic interactions > > > that I could think of. > > > > > > Design: > > > > > > This is where in the object hierarchy our bpf_link object is attached. > > > > > > ┌─────┐ > > > │ │ > > > │ BPF │ > > > program > > > │ │ > > > └──▲──┘ > > > ┌───────┐ │ > > > │ │ ┌──────┴───────┐ > > > │ mod ├─────────► cls_bpf_prog │ > > > ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘ > > > │ tcf_block │ │ │ │ │ > > > └────────┬───────┘ └───▲───┘ │ │ > > > │ ┌─────────────┐ │ ┌─▼───┴──┐ > > > └──────────► tcf_chain │ │ │bpf_link│ > > > └───────┬─────┘ │ └────────┘ > > > │ ┌─────────────┐ │ > > > └──────────► tcf_proto ├────┘ > > > └─────────────┘ > > > > > > The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way > > > allows us to implement update in a lightweight manner without having to recreate > > > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog. > > > > > > The other way to do it would be to link the bpf_link to tcf_proto, there are > > > numerous downsides to this: > > > > > > 1. All filters have to embed the pointer even though they won't be using it when > > > cls_bpf is compiled in. > > > 2. This probably won't make sense to be extended to other filter types anyway. > > > 3. We aren't able to optimize the update case without adding another bpf_link > > > specific update operation to tcf_proto ops. > > > > > > The downside with tying this to the module is having to export bpf_link > > > management functions and introducing a tcf_proto op. Hopefully the cost of > > > another operation func pointer is not big enough (as there is only one ops > > > struct per module). > > > > > Hi Kumar, > > > > Do you have any plans / bandwidth to land this feature upstream? If > > so, do you have a tentative estimation for when you'll be able to work > > on this? And if not, are you okay with someone else working on this to > > get it merged in? > > > > I can have a look at resurrecting it later this month, if you're ok with waiting > until then, otherwise if someone else wants to pick this up before that it's > fine by me, just let me know so we avoid duplicated effort. Note that the > approach in v2 is dead/unlikely to get accepted by the TC maintainers, so we'd > have to implement the way Daniel mentioned in [0]. Sounds great! We'll wait and check back in with you later this month. > > [0]: https://lore.kernel.org/bpf/15cd0a9c-95a1-9766-fca1-4bf9d09e4100@xxxxxxxxxxxxx > > > The reason I'm asking is because there are a few networking teams > > within Meta that have been requesting this feature :) > > > > Thanks, > > Joanne > > > > > Changelog: > > > ---------- > > > v1 (RFC) -> v2 > > > v1: https://lore.kernel.org/bpf/20210528195946.2375109-1-memxor@xxxxxxxxx > > > > > > * Avoid overwriting other members of union in bpf_attr (Andrii) > > > * Set link to NULL after bpf_link_cleanup to avoid double free (Andrii) > > > * Use __be16 to store the result of htons (Kernel Test Robot) > > > * Make assignment of tcf_exts::net conditional on CONFIG_NET_CLS_ACT > > > (Kernel Test Robot) > > > > > > Kumar Kartikeya Dwivedi (7): > > > net: sched: refactor cls_bpf creation code > > > bpf: export bpf_link functions for modules > > > net: sched: add bpf_link API for bpf classifier > > > net: sched: add lightweight update path for cls_bpf > > > tools: bpf.h: sync with kernel sources > > > libbpf: add bpf_link based TC-BPF management API > > > libbpf: add selftest for bpf_link based TC-BPF management API > > > > > > include/linux/bpf_types.h | 3 + > > > include/net/pkt_cls.h | 13 + > > > include/net/sch_generic.h | 6 +- > > > include/uapi/linux/bpf.h | 15 + > > > kernel/bpf/syscall.c | 14 +- > > > net/sched/cls_api.c | 139 ++++++- > > > net/sched/cls_bpf.c | 389 ++++++++++++++++-- > > > tools/include/uapi/linux/bpf.h | 15 + > > > tools/lib/bpf/bpf.c | 8 +- > > > tools/lib/bpf/bpf.h | 8 +- > > > tools/lib/bpf/libbpf.c | 59 ++- > > > tools/lib/bpf/libbpf.h | 17 + > > > tools/lib/bpf/libbpf.map | 1 + > > > tools/lib/bpf/netlink.c | 5 +- > > > tools/lib/bpf/netlink.h | 8 + > > > .../selftests/bpf/prog_tests/tc_bpf_link.c | 285 +++++++++++++ > > > 16 files changed, 940 insertions(+), 45 deletions(-) > > > create mode 100644 tools/lib/bpf/netlink.h > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c > > > > > > -- > > > 2.31.1 > > > > > -- > Kartikeya