Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 19 Jul 2022 11:00:53 -0700

On Tue, Jul 19, 2022 at 4:30 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 18-07-22 10:55:59, Yosry Ahmed wrote:
> > On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote:
> > > > On Tue 12-07-22 16:39:48, Yafang Shao wrote:
> > > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > > [...]
> > > > > > > Roman already sent reparenting fix:
> > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@xxxxxxxxx/
> > > > > >
> > > > > > Reparenting is nice but not a silver bullet. Consider a shallow
> > > > > > hierarchy where the charging happens in the first level under the root
> > > > > > memcg. Reparenting to the root is just pushing everything under the
> > > > > > system resources category.
> > > > > >
> > > > >
> > > > > Agreed. That's why I don't like reparenting.
> > > > > Reparenting just reparent the charged pages and then redirect the new
> > > > > charge, but can't reparents the 'limit' of the original memcg.
> > > > > So it is a risk if the original memcg is still being charged. We have
> > > > > to forbid the destruction of the original memcg.
> > >
> > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe*
> > > bpf maps is an exception), I don't think there is a better choice.
> > >
> > > > yes, I was toying with an idea like that. I guess we really want a
> > > > measure to keep cgroups around if they are bound to a resource which is
> > > > sticky itself. I am not sure how many other resources like BPF (aka
> > > > module like) we already do charge for memcg but considering the
> > > > potential memory consumption just reparenting will not help in general
> > > > case I am afraid.
> > >
> > > Well, then we have to make these objects a first-class citizens in cgroup API,
> > > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc.
> > > I easily can see some value here, but it's a big API change.
> > >
> > > With the current approach when a bpf map pins a memory cgroup of the creator
> > > process (which I think is completely transparent for most bpf users), I don't
> > > think preventing the deletion of a such cgroup is possible. It will break too
> > > many things.
> > >
> > > But honestly I don't see why userspace can't handle it. If there is a cgroup which
> > > contains shared bpf maps, why would it delete it? It's a weird use case, I don't
> > > think we have to optimize for it. Also, we do a ton of optimizations for live
> > > cgroups (e.g. css refcounting being percpu) which are not working for a deleted
> > > cgroup. So noone really should expect any properties from dying cgroups.
> > >
> >
> > Just a random thought here, and I can easily be wrong (and this can
> > easily be the wrong thread for this), but if we introduce a more
> > generic concept to generally tie a resource explicitly to a cgroup
> > (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the
> > cgroup from being deleted unless the resource is freed or moved to a
> > different cgroup?
>
> My understanding is that Tejun would prefer a user space defined policy
> by a proper layering. And I would tend to agree that this is less prone
> to corner cases.
>
> Anyway, how would you envision such an interface?

I imagine something like cgroup.sticky.[bpf/tmpfs/..] (I suck at naming things).

The file can contain a list of bpf map ids or tmpfs inode ids or mount
paths. You can write a new id/path to the file to add one, or read the
file to see a list of them. Basically very similar semantics to
cgroup.procs. Instead of processes, we have different types of sticky
resources here that usually outlive processes. The charging of such
resources would be deterministically attributed to this cgroup
(instead of the cgroup of the process that created the map or touched
the tmpfs file first).

Since the system admin has explicitly attributed these resources to
this cgroup, it makes sense that the cgroup cannot be deleted as long
as these resources exist and are charged to it, similar to
cgroup.procs. This also addresses some of the zombie cgroups problems.

The obvious question is: to maintain the current behavior, bpf maps
and tmpfs mounts will be by default initially charged the same way as
today. bpf maps for the cgroup of the process that created them, and
tmpfs pages on first touch basis. How do we move their charging after
their creation in the current way to a cgroup.sticky file?

For things like bpf maps, it should be relatively simple to move
charges the same way we do when we move processes, because the bpf map
is by default charged to one memcg. For tmpfs, it would be challenging
to move the charges because pages could be individually attributed to
different cgroups already. For this, we can impose a (imo reasonable)
restriction that for tmpfs (and similar resources that are currently
not attributed to a single cgroup), that you can only add a tmpfs
mount to cgroup.sticky.tmpfs for the first time if no pages have been
charged yet (no files created). Once the tmpfs mount lives in any
cgroup.sticky.tmpfs file, we know that it is charged to one cgroup,
and we can move the charges more easily.

The system admin would basically mount the tmpfs directory and then
directly write it to a cgroup.sticky.tmpfs file. For that point
onwards, all tmpfs pages will be charged to that cgroup, and they need
to be freed or moved before the cgroup can be removed.

I could be missing something here, but I imagine such an interface(s)
would address both the bpf progs/maps charging concerns, and our
previous memcg= mount attempts. Please correct me if I am wrong.

>
> > This would be optional, so the current status quo is maintainable, but
> > also gives flexibility to admins to assign resources to cgroups to
> > make sure nothing is ( unaccounted / accounted to a zombie memcg /
> > reparented to an unrelated parent ). This might be too fine-grained to
> > be practical but I just thought it might be useful. We will also need
> > to define an OOM behavior for such resources. Things like bpf maps
> > will be unreclaimable, but tmpfs memory can be swapped out.
>
> Keep in mind that the swap is a shared resource in itself. So tmpfs is
> essentially a sticky resource as well. A tmpfs file is not bound to any
> proces life time the same way BPF program is. You might need less
> priviledges to remove a file but in principle they are consuming
> resources without any explicit owner.

I fully agree, which is exactly why I am suggesting having a defined
way to handle such "sticky" resources that outlive processes.
Currently cgroups operate in terms of processes and this can be a
limitation for such resources.

> --
> Michal Hocko
> SUSE Labs