On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote: > > On Tue 12-07-22 16:39:48, Yafang Shao wrote: > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > [...] > > > > > Roman already sent reparenting fix: > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@xxxxxxxxx/ > > > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow > > > > hierarchy where the charging happens in the first level under the root > > > > memcg. Reparenting to the root is just pushing everything under the > > > > system resources category. > > > > > > > > > > Agreed. That's why I don't like reparenting. > > > Reparenting just reparent the charged pages and then redirect the new > > > charge, but can't reparents the 'limit' of the original memcg. > > > So it is a risk if the original memcg is still being charged. We have > > > to forbid the destruction of the original memcg. > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe* > bpf maps is an exception), I don't think there is a better choice. > > > yes, I was toying with an idea like that. I guess we really want a > > measure to keep cgroups around if they are bound to a resource which is > > sticky itself. I am not sure how many other resources like BPF (aka > > module like) we already do charge for memcg but considering the > > potential memory consumption just reparenting will not help in general > > case I am afraid. > > Well, then we have to make these objects a first-class citizens in cgroup API, > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc. > I easily can see some value here, but it's a big API change. > > With the current approach when a bpf map pins a memory cgroup of the creator > process (which I think is completely transparent for most bpf users), I don't > think preventing the deletion of a such cgroup is possible. It will break too > many things. > > But honestly I don't see why userspace can't handle it. If there is a cgroup which > contains shared bpf maps, why would it delete it? It's a weird use case, I don't > think we have to optimize for it. Also, we do a ton of optimizations for live > cgroups (e.g. css refcounting being percpu) which are not working for a deleted > cgroup. So noone really should expect any properties from dying cgroups. > Just a random thought here, and I can easily be wrong (and this can easily be the wrong thread for this), but if we introduce a more generic concept to generally tie a resource explicitly to a cgroup (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the cgroup from being deleted unless the resource is freed or moved to a different cgroup? This would be optional, so the current status quo is maintainable, but also gives flexibility to admins to assign resources to cgroups to make sure nothing is ( unaccounted / accounted to a zombie memcg / reparented to an unrelated parent ). This might be too fine-grained to be practical but I just thought it might be useful. We will also need to define an OOM behavior for such resources. Things like bpf maps will be unreclaimable, but tmpfs memory can be swapped out. I think this also partially addresses Johannes's concerns that the memcg= mount option uses file system mounts to create shareable resource domains outside of the cgroup hierarchy. > Thanks! >