Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 20 May 2022 14:18:42 -0700

On Fri, May 20, 2022 at 12:43 PM Hao Luo <haoluo@xxxxxxxxxx> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > > Maybe you can have a bpf program signature like below:
> > >
> > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > > struct cgroup *parent_cgrp)
> > >
> > > parent_cgrp is NULL when cgrp is the root cgroup.
> > >
> > > I would like the bpf program should send the following information to
> > > user space:
> > >    <parent cgroup dir name> <current cgroup dir name>
> >
> > I don't think parent cgroup dir name would be sufficient to reconstruct the
> > path given that multiple cgroups in different subtrees can have the same
> > name. For live cgroups, userspace can find the path from id (or ino) without
> > traversing anything by constructing the fhandle, open it open_by_handle_at()
> > and then reading /proc/self/fd/$FD symlink -
> > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> > but I'm not sure how much that'd matter given that they aren't visible from
> > userspace anyway.
> >
>
> Sending cgroup id is better than cgroup dir name, also because IIUC
> the path obtained from cgroup id depends on the namespace of the
> userspace process. So if the dump file may be potentially read by
> processes within a container, it's better to have the output
> namespaced IMO.
>
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>
> > > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > > before calling cgroup_iter bpf program.
>
> Sounds good to me as well. But my knowledge on rstat_flush is limited.
> Yosry can give this a try.
>
> >
> > Would it work to just pass in @cgrp and provide a group of helpers so that
> > the program can do whatever it wanna do including looking up the full path
> > and passing that to userspace?
> >
>
> My understanding is, yes, doable. If we need the full path information
> of a cgroup, helpers or kfuncs are needed.
>
> The userspace needs to specify the identity of the cgroup, when
> creating bpf_iter. This identity could be cgroup id or fd. This
> identity needs to be converted to cgroup object somewhere before
> passing into bpf program to use.

Let's sum up the discussion here, I feel like we are losing track of
the main problem. IIUC the main concern is that cgroup_iter is not
effectively an iterator, it rather dumps information for one cgroup. I
like the suggestion to make it iterate cgroups by default, and an
optional cgroup_id parameter to make it only "iterate" this one
cgroup. IIUC, this cgroup_id parameter would be a link parameter,
similar to the current approach. Basically, we extend the current
patch so that if cgroup_id is not specified the iterator gets called
for all cgroups instead of one. This fixes the problem for our use
case and also keeps cgroup_iter generic enough. Is my understanding
correct? If yes, I don't see a need to flush rstat in the kernel on
behalf of cgroup_iter progs.