On Thu, Jun 03, 2021 at 02:39:15AM IST, Andrii Nakryiko wrote: > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi > <memxor@xxxxxxxxx> wrote: > > > > This is the first RFC version. > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and > > introduces fd based ownership for such TC filters. Netlink cannot delete or > > replace such filters, but the bpf_link is severed on indirect destruction of the > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure > > that filters remain attached beyond process lifetime, the usual bpf_link fd > > pinning approach can be used. > > > > The individual patches contain more details and comments, but the overall kernel > > API and libbpf helper mirrors the semantics of the netlink based TC-BPF API > > merged recently. This means that we start by always setting direct action mode, > > protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more > > options in the future, they can be easily exposed through the bpf_link API in > > the future. > > > > Patch 1 refactors cls_bpf change function to extract two helpers that will be > > reused in bpf_link creation. > > > > Patch 2 exports some bpf_link management functions to modules. This is needed > > because our bpf_link object is tied to the cls_bpf_prog object. Tying it to > > tcf_proto would be weird, because the update path has to replace offloaded bpf > > prog, which happens using internal cls_bpf helpers, and would in general be more > > code to abstract over an operation that is unlikely to be implemented for other > > filter types. > > > > Patch 3 adds the main bpf_link API. A function in cls_api takes care of > > obtaining block reference, creating the filter object, and then calls the > > bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after > > setting up the internal structures. An optimization is made to not keep around > > resources for extended actions, which is explained in a code comment as it wasn't > > immediately obvious. > > > > Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports > > replacing the bpf_prog, we can skip tc filter's change path by reusing the > > filter object but swapping its bpf_prog. This takes care of replacing the > > offloaded prog as well (if that fails, update is aborted). So far however, > > tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter > > would never be modified concurrently. This is no longer true, and to not > > penalize the classify hot path, we also cannot impose serialization around > > its load. Hence the load is changed to READ_ONCE, so that the pointer value is > > always consistent. Due to invocation in a RCU critical section, the lifetime of > > the prog is guaranteed for the duration of the call. > > > > Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning > > function to libbpf. > > > > Patch 7 adds a selftest that exercises all possible problematic interactions > > that I could think of. > > > > Design: > > > > This is where in the object hierarchy our bpf_link object is attached. > > > > ┌─────┐ > > │ │ > > │ BPF │ > > program > > │ │ > > └──▲──┘ > > ┌───────┐ │ > > │ │ ┌──────┴───────┐ > > │ mod ├─────────► cls_bpf_prog │ > > ┌────────────────┐ │cls_bpf│ └────┬───▲─────┘ > > │ tcf_block │ │ │ │ │ > > └────────┬───────┘ └───▲───┘ │ │ > > │ ┌─────────────┐ │ ┌─▼───┴──┐ > > └──────────► tcf_chain │ │ │bpf_link│ > > └───────┬─────┘ │ └────────┘ > > │ ┌─────────────┐ │ > > └──────────► tcf_proto ├────┘ > > └─────────────┘ > > > > The bpf_link is detached on destruction of the cls_bpf_prog. Doing it this way > > allows us to implement update in a lightweight manner without having to recreate > > a new filter, where we can just replace the BPF prog attached to cls_bpf_prog. > > > > The other way to do it would be to link the bpf_link to tcf_proto, there are > > numerous downsides to this: > > > > 1. All filters have to embed the pointer even though they won't be using it when > > cls_bpf is compiled in. > > 2. This probably won't make sense to be extended to other filter types anyway. > > 3. We aren't able to optimize the update case without adding another bpf_link > > specific update operation to tcf_proto ops. > > > > The downside with tying this to the module is having to export bpf_link > > management functions and introducing a tcf_proto op. Hopefully the cost of > > another operation func pointer is not big enough (as there is only one ops > > struct per module). > > > > This first version is to collect feedback on the approach and get ideas if there > > is a better way to do this. > > Bpf_link-based TC API is a long time coming, so it's great to see > someone finally working on this. Thanks! > > I briefly skimmed through the patch set, noticed a few generic > bpf_link problems. But I think main feedback will come from Cilium Thanks for the review. I'll fix both of these in the resend (also have a couple of private reports from the kernel test robot). > folks and others that heavily rely on TC APIs. I wonder if there is an > opportunity to simplify the API further given we have a new > opportunity here. I don't think we are constrained to follow legacy TC > API exactly. > I tried to keep it simple by going for the defaults we agreed upon for the new netlink based libbpf API, and always setting direct action mode, and it's still in a position to be extended in the future to allow full TC filter setup like netlink does, if someone ever happens to need that. As for the implementation, I did notice that there has been discussion around this (though I could only find [0]) but I think doing it the way this patch does is more flexible as you can attach the bpf filter to an aribitrary parent/class, not just ingress and egress, and it can coexist with a conventional TC setup. [0]: https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html ("Note that there is ongoing work ...") > The problem is that your patch set was marked as spam by Google, so I > suspect a bunch of folks haven't gotten it. I suggest re-sending it > again but trimming down the CC list, leaving only bpf@vger, > netdev@vger, and BPF maintainers CC'ed directly. > Thanks for the heads up, I'll resend tomorrow. -- Kartikeya