Re: [PATCH bpf-next v5 4/8] bpf: Introduce cgroup iter

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 2 Aug 2022 15:49:54 -0700

On Tue, Aug 2, 2022 at 3:27 PM Hao Luo <haoluo@xxxxxxxxxx> wrote:
>
> Hi Andrii,
>
> On Mon, Aug 1, 2022 at 8:43 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Jul 22, 2022 at 10:48 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > >
> > > From: Hao Luo <haoluo@xxxxxxxxxx>
> > >
> > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes:
> > >
> > >  - walking a cgroup's descendants in pre-order.
> > >  - walking a cgroup's descendants in post-order.
> > >  - walking a cgroup's ancestors.
> > >
> > > When attaching cgroup_iter, one can set a cgroup to the iter_link
> > > created from attaching. This cgroup is passed as a file descriptor and
> > > serves as the starting point of the walk. If no cgroup is specified,
> > > the starting point will be the root cgroup.
> > >
> > > For walking descendants, one can specify the order: either pre-order or
> > > post-order. For walking ancestors, the walk starts at the specified
> > > cgroup and ends at the root.
> > >
> > > One can also terminate the walk early by returning 1 from the iter
> > > program.
> > >
> > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> > > program is called with cgroup_mutex held.
> > >
> > > Currently only one session is supported, which means, depending on the
> > > volume of data bpf program intends to send to user space, the number
> > > of cgroups that can be walked is limited. For example, given the current
> > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
> > > cgroup, the total number of cgroups that can be walked is 512. This is
> > > a limitation of cgroup_iter. If the output data is larger than the
> > > buffer size, the second read() will signal EOPNOTSUPP. In order to work
> > > around, the user may have to update their program to reduce the volume
> > > of data sent to output. For example, skip some uninteresting cgroups.
> > > In future, we may extend bpf_iter flags to allow customizing buffer
> > > size.
> > >
> > > Signed-off-by: Hao Luo <haoluo@xxxxxxxxxx>
> > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > Acked-by: Yonghong Song <yhs@xxxxxx>
> > > ---
> > >  include/linux/bpf.h                           |   8 +
> > >  include/uapi/linux/bpf.h                      |  30 +++
> > >  kernel/bpf/Makefile                           |   3 +
> > >  kernel/bpf/cgroup_iter.c                      | 252 ++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h                |  30 +++
> > >  .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
> > >  6 files changed, 325 insertions(+), 2 deletions(-)
> > >  create mode 100644 kernel/bpf/cgroup_iter.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index a97751d845c9..9061618fe929 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -47,6 +47,7 @@ struct kobject;
> > >  struct mem_cgroup;
> > >  struct module;
> > >  struct bpf_func_state;
> > > +struct cgroup;
> > >
> > >  extern struct idr btf_idr;
> > >  extern spinlock_t btf_idr_lock;
> > > @@ -1717,7 +1718,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> > >         int __init bpf_iter_ ## target(args) { return 0; }
> > >
> > >  struct bpf_iter_aux_info {
> > > +       /* for map_elem iter */
> > >         struct bpf_map *map;
> > > +
> > > +       /* for cgroup iter */
> > > +       struct {
> > > +               struct cgroup *start; /* starting cgroup */
> > > +               int order;
> > > +       } cgroup;
> > >  };
> > >
> > >  typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index ffcbf79a556b..fe50c2489350 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -87,10 +87,30 @@ struct bpf_cgroup_storage_key {
> > >         __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
> > >  };
> > >
> > > +enum bpf_iter_cgroup_traversal_order {
> > > +       BPF_ITER_CGROUP_PRE = 0,        /* pre-order traversal */
> > > +       BPF_ITER_CGROUP_POST,           /* post-order traversal */
> > > +       BPF_ITER_CGROUP_PARENT_UP,      /* traversal of ancestors up to the root */
> >
> > I've just put up my arguments why it's a good idea to also support a
> > "trivial" mode of only traversing specified cgroup and no descendants
> > or parents. Please see [0].
>
> cc Kui-Feng in this thread.
>
> Yeah, I think it's a good idea. It's useful when we only want to show
> a single object, which can be common. Going further, I think we may
> want to restructure bpf_iter to optimize for this case.
>
> > I think the same applies here, especially
> > considering that it seems like a good idea to support
> > task/task_vma/task_files iteration within a cgroup.
>
> I have reservations on these use cases. I don't see immediate use of
> iterating vma or files within a cgroup. Tasks within a cgroup? Maybe.
> :)
>

iter/task was what I had in mind in the first place. But I can also
imagine tools utilizing iter/task_files for each process within a
cgroup, so given iter/{task, task_file, task_vma} share the same UAPI
and internals, I don't see why we'd restrict this to only iter/task.

> > So depending on
> > how successful I am in arguing for supporting task iterator with
> > target cgroup, I think we should reuse *exactly* this
> > bpf_iter_cgroup_traversal_order and how we specify cgroup (FD or ID,
> > see some more below) *as is* in task iterators as well. In the latter
> > case, having an ability to say "iterate task for only given cgroup" is
> > very useful, and for such mode all the PRE/POST/PARENT_UP is just an
> > unnecessary nuisance.
> >
> > So please consider also adding and supporting BPF_ITER_CGROUP_SELF (or
> > whatever naming makes most sense).
> >
>
> PRE/POST/UP can be reused for iter of tree-structured containers, like
> rbtree [1]. SELF can be reused for any iters like iter/task,
> iter/cgroup, etc. Promoting all of them out of cgroup-specific struct
> seems valuable.

you mean just define them as generic tree traversal orders? Sure, I
guess makes sense. No strong feelings.

>
> [1] https://lwn.net/Articles/902405/
>
> >
> > Some more naming nits. I find BPF_ITER_CGROUP_PRE and
> > BPF_ITER_CGROUP_POST a bit confusing. Even internally in kernel we
> > have css_next_descendant_pre/css_next_descendant_post, so why not
> > reflect the fact that we are going to iterate descendants:
> > BPF_ITER_CGROUP_DESCENDANTS_{PRE,POST}. And now that we use
> > "descendants" terminology, PARENT_UP should be ANCESTORS. ANCESTORS_UP
> > probably is fine, but seems a bit redundant (unless we consider a
> > somewhat weird ANCESTORS_DOWN, where we find the furthest parent and
> > then descend through preceding parents until we reach specified
> > cgroup; seems a bit exotic).
> >
>
> BPF_ITER_CGROUP_DESCENDANTS_PRE is too verbose. If there is a
> possibility of merging rbtree and supporting walk order of rbtree
> iter, maybe the name here could be general, like
> BPF_ITER_DESCENDANTS_PRE, which seems better.

it's not like you'll be typing this hundreds of type, so verboseness
doesn't seem to be too problematic, but sure, BPF_ITER_DESCENDANTS_PRE
is fine with me

>
> >   [0] https://lore.kernel.org/bpf/f92e20e9961963e20766e290ee6668edd4bacf06.camel@xxxxxx/T/#m5ce50632aa550dd87a99241efb168cbcde1ee98f
> >
> > > +};
> > > +
> > >  union bpf_iter_link_info {
> > >         struct {
> > >                 __u32   map_fd;
> > >         } map;
> > > +
> > > +       /* cgroup_iter walks either the live descendants of a cgroup subtree, or the
> > > +        * ancestors of a given cgroup.
> > > +        */
> > > +       struct {
> > > +               /* Cgroup file descriptor. This is root of the subtree if walking
> > > +                * descendants; it's the starting cgroup if walking the ancestors.
> > > +                * If it is left 0, the traversal starts from the default cgroup v2
> > > +                * root. For walking v1 hierarchy, one should always explicitly
> > > +                * specify the cgroup_fd.
> > > +                */
> > > +               __u32   cgroup_fd;
> >
> > Now, similar to what I argued in regard of pidfd vs pid, I think the
> > same applied to cgroup_fd vs cgroup_id. Why can't we support both?
> > cgroup_fd has some benefits, but cgroup_id is nice due to simplicity
> > and not having to open/close/keep extra FDs (which can add up if we
> > want to periodically query something about a large set of cgroups).
> > Please see my arguments from [0] above.
> >
> > Thoughts?
> >
>
> We can support both, it's a good idea IMO. But what exactly is the
> interface going to look like? Can you be more specific about that?
> Below is something I tried based on your description.
>
> @@ -91,6 +91,18 @@ union bpf_iter_link_info {
>         struct {
>                 __u32   map_fd;
>         } map;
> +       struct {
> +               /* PRE/POST/UP/SELF */
> +               __u32 order;
> +               struct {
> +                       __u32 cgroup_fd;
> +                       __u64 cgroup_id;
> +               } cgroup;
> +               struct {
> +                       __u32 pid_fd;
> +                       __u64 pid;
> +               } task;
> +       };
>  };
>

So I wouldn't combine task and cgroup definition together, let's keep
them independent.

then for cgroup we can do something like:

struct {
    __u32 order;
    __u32 cgroup_fd; /* cgroup_fd ^ cgroup_id, exactly one can be non-zero */
    __u32 cgroup_id;
} cgroup

Similar idea with task, but it's a bit more complicated because there
we have target that can be pid, pidfd, or cgroup (cgroup_fd and
cgroup_id). I haven't put much thought into the best representation,
though.

> > > +               __u32   traversal_order;
> > > +       } cgroup;
> > >  };
> > >
> > >  /* BPF syscall commands, see bpf(2) man-page for more details. */
> >
> > [...]