On Wed, Jul 27, 2022 at 10:49 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 7/22/22 10:48 AM, Yosry Ahmed wrote: > > From: Hao Luo <haoluo@xxxxxxxxxx> > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes: > > > > - walking a cgroup's descendants in pre-order. > > - walking a cgroup's descendants in post-order. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Currently only one session is supported, which means, depending on the > > volume of data bpf program intends to send to user space, the number > > of cgroups that can be walked is limited. For example, given the current > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > cgroup, the total number of cgroups that can be walked is 512. This is > > PAGE_SIZE needs to be 4KB in order to conclude that the total number of > walked cgroups is 512. > Sure. Will change that. > > a limitation of cgroup_iter. If the output data is larger than the > > buffer size, the second read() will signal EOPNOTSUPP. In order to work > > around, the user may have to update their program to reduce the volume > > of data sent to output. For example, skip some uninteresting cgroups. > > In future, we may extend bpf_iter flags to allow customizing buffer > > size. > > > > Signed-off-by: Hao Luo <haoluo@xxxxxxxxxx> > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Acked-by: Yonghong Song <yhs@xxxxxx> > > --- > > include/linux/bpf.h | 8 + > > include/uapi/linux/bpf.h | 30 +++ > > kernel/bpf/Makefile | 3 + > > kernel/bpf/cgroup_iter.c | 252 ++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 30 +++ > > .../selftests/bpf/prog_tests/btf_dump.c | 4 +- > > 6 files changed, 325 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/cgroup_iter.c > > This patch cannot apply to bpf-next cleanly, so please rebase > and post again. > Sorry about that. Will do. > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index a97751d845c9..9061618fe929 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -47,6 +47,7 @@ struct kobject; > > struct mem_cgroup; > > struct module; > > struct bpf_func_state; > > +struct cgroup; > > > > extern struct idr btf_idr; > > extern spinlock_t btf_idr_lock; > > @@ -1717,7 +1718,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags); > > int __init bpf_iter_ ## target(args) { return 0; } > > > > struct bpf_iter_aux_info { > > + /* for map_elem iter */ > > struct bpf_map *map; > > + > > + /* for cgroup iter */ > > + struct { > > + struct cgroup *start; /* starting cgroup */ > > + int order; > > + } cgroup; > > }; > > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index ffcbf79a556b..fe50c2489350 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -87,10 +87,30 @@ struct bpf_cgroup_storage_key { > > __u32 attach_type; /* program attach type (enum bpf_attach_type) */ > > }; > > > > +enum bpf_iter_cgroup_traversal_order { > > + BPF_ITER_CGROUP_PRE = 0, /* pre-order traversal */ > > + BPF_ITER_CGROUP_POST, /* post-order traversal */ > > + BPF_ITER_CGROUP_PARENT_UP, /* traversal of ancestors up to the root */ > > +}; > > + > > union bpf_iter_link_info { > > struct { > > __u32 map_fd; > > } map; > > + > > + /* cgroup_iter walks either the live descendants of a cgroup subtree, or the > > + * ancestors of a given cgroup. > > + */ > > + struct { > > + /* Cgroup file descriptor. This is root of the subtree if walking > > + * descendants; it's the starting cgroup if walking the ancestors. > > + * If it is left 0, the traversal starts from the default cgroup v2 > > + * root. For walking v1 hierarchy, one should always explicitly > > + * specify the cgroup_fd. > > + */ > > I did see how the above cgroup v1/v2 scenarios are enforced. > Do you mean _not_ see? Yosry and I experimented a bit. We found even on systems where v2 is not enabled, cgroup v2 root always exists and can be attached to, and can be iterated on (only trivially). We didn't find a way to tell v1 and v2 apart and deemed a comment to instruct v1 users is fine? > > + __u32 cgroup_fd; > > + __u32 traversal_order; > > + } cgroup; > > }; > > > > /* BPF syscall commands, see bpf(2) man-page for more details. */ > > @@ -6136,6 +6156,16 @@ struct bpf_link_info { > > __u32 map_id; > > } map; > > }; > > + union { > > + struct { > > + __u64 cgroup_id; > > + __u32 traversal_order; > > + } cgroup; > > + }; > > + /* For new iters, if the first field is larger than __u32, > > + * the struct should be added in the second union. Otherwise, > > + * it will create holes before map_id, breaking uapi. > > + */ > > Please put the comment above the union. Let us just say, if > the iter specific field is __u32, it can be put in the first or > second union. Otherwise, it is put in second union. > Ok. > > } iter; > > struct { > > __u32 netns_ino; > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > index 057ba8e01e70..00e05b69a4df 100644 > > --- a/kernel/bpf/Makefile > > +++ b/kernel/bpf/Makefile > > @@ -24,6 +24,9 @@ endif > > ifeq ($(CONFIG_PERF_EVENTS),y) > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > endif > > +ifeq ($(CONFIG_CGROUPS),y) > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > +endif > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > ifeq ($(CONFIG_INET),y) > > obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o > > diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c > > new file mode 100644 > > index 000000000000..1027faed0b8b > > --- /dev/null > > +++ b/kernel/bpf/cgroup_iter.c > > @@ -0,0 +1,252 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright (c) 2022 Google */ > > +#include <linux/bpf.h> > > +#include <linux/btf_ids.h> > > +#include <linux/cgroup.h> > > +#include <linux/kernel.h> > > +#include <linux/seq_file.h> > > + > > +#include "../cgroup/cgroup-internal.h" /* cgroup_mutex and cgroup_is_dead */ > > + > > +/* cgroup_iter provides three modes of traversal to the cgroup hierarchy. > > + * > > + * 1. Walk the descendants of a cgroup in pre-order. > > + * 2. Walk the descendants of a cgroup in post-order. > > + * 2. Walk the ancestors of a cgroup. > > + * > > + * For walking descendants, cgroup_iter can walk in either pre-order or > > + * post-order. For walking ancestors, the iter walks up from a cgroup to > > + * the root. > > + * > > + * The iter program can terminate the walk early by returning 1. Walk > > + * continues if prog returns 0. > > + * > > + * The prog can check (seq->num == 0) to determine whether this is > > + * the first element. The prog may also be passed a NULL cgroup, > > + * which means the walk has completed and the prog has a chance to > > + * do post-processing, such as outputing an epilogue. > > + * > > + * Note: the iter_prog is called with cgroup_mutex held. > > + * > > + * Currently only one session is supported, which means, depending on the > > + * volume of data bpf program intends to send to user space, the number > > + * of cgroups that can be walked is limited. For example, given the current > > + * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > + * cgroup, the total number of cgroups that can be walked is 512. This is > > Again, let us specify PAGE_SIZE = 4KB here. > Ok. > > + * a limitation of cgroup_iter. If the output data is larger than the > > + * buffer size, the second read() will signal EOPNOTSUPP. In order to work > > + * around, the user may have to update their program to reduce the volume > > + * of data sent to output. For example, skip some uninteresting cgroups. > > + */ > > + > > +struct bpf_iter__cgroup { > > + __bpf_md_ptr(struct bpf_iter_meta *, meta); > > + __bpf_md_ptr(struct cgroup *, cgroup); > > +}; > > + > > +struct cgroup_iter_priv { > > + struct cgroup_subsys_state *start_css; > > + bool terminate; > > + int order; > > +}; > > + > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + > > + mutex_lock(&cgroup_mutex); > > + > > + /* cgroup_iter doesn't support read across multiple sessions. */ > > + if (*pos > 0) > > + return ERR_PTR(-EOPNOTSUPP); > > This is not quite right. Let us say, the number of cgroups is 1, > after bpf program run, pos = 1, and the control return to user > space. Now the second read() will return -EOPNOTSUPP which is not > right. -EOPNOTSUPP should be returned ONLY if the previous cgroup > iterations do not traverse all cgroups. > So you might need to record additional information in cgroup_iter_priv > to record such information. > Oh, I missed seeing this scenario. Then we need a flag in cgroup_iter_priv indicating whether iter is happening. Only when during iter hasn't ended, we return -EOPNOTSUPP. > > + > > + ++*pos; > > + p->terminate = false; > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(NULL, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(NULL, p->start_css); > > + else /* BPF_ITER_CGROUP_PARENT_UP */ > > + return p->start_css; > > +} > > + > > +static int __cgroup_iter_seq_show(struct seq_file *seq, > > + struct cgroup_subsys_state *css, int in_stop); > > + > > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v) > > +{ > > + /* pass NULL to the prog for post-processing */ > > + if (!v) > > + __cgroup_iter_seq_show(seq, NULL, true); > > + mutex_unlock(&cgroup_mutex); > > +} > > + > > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos) > > +{ > > + struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v; > > + struct cgroup_iter_priv *p = seq->private; > > + > > + ++*pos; > > + if (p->terminate) > > + return NULL; > > + > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(curr, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(curr, p->start_css); > > + else > > + return curr->parent; > > +} > > + > [...]