Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> This is the first RFC version.
>
> This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> introduces fd based ownership for such TC filters. Netlink cannot delete or
> replace such filters, but the bpf_link is severed on indirect destruction of the
> filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> that filters remain attached beyond process lifetime, the usual bpf_link fd
> pinning approach can be used.
>
> The individual patches contain more details and comments, but the overall kernel
> API and libbpf helper mirrors the semantics of the netlink based TC-BPF API
> merged recently. This means that we start by always setting direct action mode,
> protocol to ETH_P_ALL, chain_index as 0, etc. If there is a need for more
> options in the future, they can be easily exposed through the bpf_link API in
> the future.
>
> Patch 1 refactors cls_bpf change function to extract two helpers that will be
> reused in bpf_link creation.
>
> Patch 2 exports some bpf_link management functions to modules. This is needed
> because our bpf_link object is tied to the cls_bpf_prog object. Tying it to
> tcf_proto would be weird, because the update path has to replace offloaded bpf
> prog, which happens using internal cls_bpf helpers, and would in general be more
> code to abstract over an operation that is unlikely to be implemented for other
> filter types.
>
> Patch 3 adds the main bpf_link API. A function in cls_api takes care of
> obtaining block reference, creating the filter object, and then calls the
> bpf_link_change tcf_proto op (only supported by cls_bpf) that returns a fd after
> setting up the internal structures. An optimization is made to not keep around
> resources for extended actions, which is explained in a code comment as it wasn't
> immediately obvious.
>
> Patch 4 adds an update path for bpf_link. Since bpf_link_update only supports
> replacing the bpf_prog, we can skip tc filter's change path by reusing the
> filter object but swapping its bpf_prog. This takes care of replacing the
> offloaded prog as well (if that fails, update is aborted). So far however,
> tcf_classify could do normal load (possibly torn) as the cls_bpf_prog->filter
> would never be modified concurrently. This is no longer true, and to not
> penalize the classify hot path, we also cannot impose serialization around
> its load. Hence the load is changed to READ_ONCE, so that the pointer value is
> always consistent. Due to invocation in a RCU critical section, the lifetime of
> the prog is guaranteed for the duration of the call.
>
> Patch 5, 6 take care of updating the userspace bits and add a bpf_link returning
> function to libbpf.
>
> Patch 7 adds a selftest that exercises all possible problematic interactions
> that I could think of.
>
> Design:
>
> This is where in the object hierarchy our bpf_link object is attached.
>
>                                                                             ┌─────┐
>                                                                             │     │
>                                                                             │ BPF │
>                                                                             program
>                                                                             │     │
>                                                                             └──▲──┘
>                                                       ┌───────┐                │
>                                                       │       │         ┌──────┴───────┐
>                                                       │  mod  ├─────────► cls_bpf_prog │
> ┌────────────────┐                                    │cls_bpf│         └────┬───▲─────┘
> │    tcf_block   │                                    │       │              │   │
> └────────┬───────┘                                    └───▲───┘              │   │
>          │          ┌─────────────┐                       │                ┌─▼───┴──┐
>          └──────────►  tcf_chain  │                       │                │bpf_link│
>                     └───────┬─────┘                       │                └────────┘
>                             │          ┌─────────────┐    │
>                             └──────────►  tcf_proto  ├────┘
>                                        └─────────────┘
>
> The bpf_link is detached on destruction of the cls_bpf_prog.  Doing it this way
> allows us to implement update in a lightweight manner without having to recreate
> a new filter, where we can just replace the BPF prog attached to cls_bpf_prog.
>
> The other way to do it would be to link the bpf_link to tcf_proto, there are
> numerous downsides to this:
>
> 1. All filters have to embed the pointer even though they won't be using it when
> cls_bpf is compiled in.
> 2. This probably won't make sense to be extended to other filter types anyway.
> 3. We aren't able to optimize the update case without adding another bpf_link
> specific update operation to tcf_proto ops.
>
> The downside with tying this to the module is having to export bpf_link
> management functions and introducing a tcf_proto op. Hopefully the cost of
> another operation func pointer is not big enough (as there is only one ops
> struct per module).
>
> This first version is to collect feedback on the approach and get ideas if there
> is a better way to do this.

Bpf_link-based TC API is a long time coming, so it's great to see
someone finally working on this. Thanks!

I briefly skimmed through the patch set, noticed a few generic
bpf_link problems. But I think main feedback will come from Cilium
folks and others that heavily rely on TC APIs. I wonder if there is an
opportunity to simplify the API further given we have a new
opportunity here. I don't think we are constrained to follow legacy TC
API exactly.

The problem is that your patch set was marked as spam by Google, so I
suspect a bunch of folks haven't gotten it. I suggest re-sending it
again but trimming down the CC list, leaving only bpf@vger,
netdev@vger, and BPF maintainers CC'ed directly.

>
> Kumar Kartikeya Dwivedi (7):
>   net: sched: refactor cls_bpf creation code
>   bpf: export bpf_link functions for modules
>   net: sched: add bpf_link API for bpf classifier
>   net: sched: add lightweight update path for cls_bpf
>   tools: bpf.h: sync with kernel sources
>   libbpf: add bpf_link based TC-BPF management API
>   libbpf: add selftest for bpf_link based TC-BPF management API
>
>  include/linux/bpf_types.h                     |   3 +
>  include/net/pkt_cls.h                         |  13 +
>  include/net/sch_generic.h                     |   6 +-
>  include/uapi/linux/bpf.h                      |  15 +
>  kernel/bpf/syscall.c                          |  14 +-
>  net/sched/cls_api.c                           | 138 ++++++-
>  net/sched/cls_bpf.c                           | 386 ++++++++++++++++--
>  tools/include/uapi/linux/bpf.h                |  15 +
>  tools/lib/bpf/bpf.c                           |   5 +
>  tools/lib/bpf/bpf.h                           |   8 +-
>  tools/lib/bpf/libbpf.c                        |  59 ++-
>  tools/lib/bpf/libbpf.h                        |  17 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  tools/lib/bpf/netlink.c                       |   5 +-
>  tools/lib/bpf/netlink.h                       |   8 +
>  .../selftests/bpf/prog_tests/tc_bpf_link.c    | 285 +++++++++++++
>  16 files changed, 934 insertions(+), 44 deletions(-)
>  create mode 100644 tools/lib/bpf/netlink.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_bpf_link.c
>
> --
> 2.31.1
>




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux