Re: [PATCH bpf-next 1/2] bpf: Allow top down cgroup prog ordering

Yonghong Song <yonghong.song@xxxxxxxxx> · Tue, 11 Feb 2025 21:35:33 -0800

On 2/11/25 2:57 PM, Andrii Nakryiko wrote:
On Thu, Feb 6, 2025 at 3:00 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups. For example, the following
cgroup hierarchy
     root cgroup: p1, p2
         subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
     p3 p4 p1 p2
and at run time, the progs will execute based on that order.

But in some cases, it is desirable to have root prog executes earlier than
children progs. For example,
   - prog p1 intends to collect original pkt dest addresses.
   - prog p3 will modify original pkt dest addresses to a proxy address for
     security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Also, putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.

To fix this issue, let us introduce a flag BPF_F_PRIO_TOPDOWN. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom. For example, in the
above example,
     root cgroup: p1, p2
         subcgroup: p3, p4
Let us say p1, p2 and p4 are marked with BPF_F_PRIO_TOPDOWN. The final
I'm not a big fan of PRIO_TOPDOWN naming, and this example just
provides further argument for why. Between p3 and p4 programs in
subcgroup, there is no notion of TOPDOWN, they are at the same level
of the hierarchy.

In graphs, for DFS, PRIO_TOPDOWN semantics corresponds to pre-order vs
(current and default) post-order. So why not something like
BPF_F_PREORDER or some variation on that?

BPF_F_PREORDER sounds good to me.


Also, for your example if would be nicer if p1 and p3 were the default
post-order attachment, while p2 and p4 were pre-order. Then you'd have
p2, p4, p3, p1, where everything is swapped relative to original
ordering ;)

Yes, will adjust the test for such scenario!


effective array ordering will be
     p1 p2 p4 p3

Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
  include/linux/bpf-cgroup.h     |  1 +
  include/uapi/linux/bpf.h       |  1 +
  kernel/bpf/cgroup.c            | 37 +++++++++++++++++++++++++++++++---
  kernel/bpf/syscall.c           |  3 ++-
  tools/include/uapi/linux/bpf.h |  1 +
  5 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 7fc69083e745..3d4f221df9ef 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -111,6 +111,7 @@ struct bpf_prog_list {
         struct bpf_prog *prog;
         struct bpf_cgroup_link *link;
         struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE];
+       bool is_prio_topdown;
let's go with `int flags`, we increase the size of struct
bpf_prog_list by 8 bytes anyways, so let's make this a bit more
generic?

Ack.


  };

  int cgroup_bpf_inherit(struct cgroup *cgrp);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fff6cdb8d11a..7ae8e8751e78 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1207,6 +1207,7 @@ enum bpf_perf_event_type {
  #define BPF_F_BEFORE           (1U << 3)
  #define BPF_F_AFTER            (1U << 4)
  #define BPF_F_ID               (1U << 5)
+#define BPF_F_PRIO_TOPDOWN     (1U << 6)
  #define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */

  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 46e5db65dbc8..f31250c6025b 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -382,6 +382,21 @@ static u32 prog_list_length(struct hlist_head *head)
         return cnt;
  }

+static u32 prog_list_length_with_topdown_cnt(struct hlist_head *head, int *topdown_cnt)
instead of duplicating prog_list_length(), let's add this `int
*topdown_cnt` counter as an optional argument, which prog_list_length
will fill out only if it's provided, i.e., you'll just have:

if (topdown_cnt && pl->is_prio_topdown)
    (*topdown_cnt) += 1;

as one extra condition inside the loop?

I thought about this as well. I tried to create a new function since
prog_list_length() is used in several different places. This
is not critical path, so yes, I can just add one addititional parameter
for prog_list_length() as you suggested.


+{
+       struct bpf_prog_list *pl;
+       u32 cnt = 0;
+
+       hlist_for_each_entry(pl, head, node) {
+               if (!prog_list_prog(pl))
+                       continue;
+               cnt++;
+               if (pl->is_prio_topdown)
+                       (*topdown_cnt) += 1;
+       }
+       return cnt;
+}
+