On Wed, Jul 13, 2022 at 10:39 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote: > > On Tue 12-07-22 16:39:48, Yafang Shao wrote: > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > [...] > > > > > Roman already sent reparenting fix: > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@xxxxxxxxx/ > > > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow > > > > hierarchy where the charging happens in the first level under the root > > > > memcg. Reparenting to the root is just pushing everything under the > > > > system resources category. > > > > > > > > > > Agreed. That's why I don't like reparenting. > > > Reparenting just reparent the charged pages and then redirect the new > > > charge, but can't reparents the 'limit' of the original memcg. > > > So it is a risk if the original memcg is still being charged. We have > > > to forbid the destruction of the original memcg. > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe* > bpf maps is an exception), I don't think there is a better choice. > > > yes, I was toying with an idea like that. I guess we really want a > > measure to keep cgroups around if they are bound to a resource which is > > sticky itself. I am not sure how many other resources like BPF (aka > > module like) we already do charge for memcg but considering the > > potential memory consumption just reparenting will not help in general > > case I am afraid. > > Well, then we have to make these objects a first-class citizens in cgroup API, > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc. > I easily can see some value here, but it's a big API change. > > With the current approach when a bpf map pins a memory cgroup of the creator > process (which I think is completely transparent for most bpf users), I don't > think preventing the deletion of a such cgroup is possible. It will break too > many things. > > But honestly I don't see why userspace can't handle it. If there is a cgroup which > contains shared bpf maps, why would it delete it? It's a weird use case, I don't > think we have to optimize for it. Also, we do a ton of optimizations for live > cgroups (e.g. css refcounting being percpu) which are not working for a deleted > cgroup. So noone really should expect any properties from dying cgroups. > I think we have discussed why the user can't handle it easily. Actually It's NOT a weird use case if you are a k8s user. (Of course it may seem weird to the systemd user, but unfortunately systemd doesn't rule the whole world. ) I have told you that it is not reasonable to refuse a containerized process to pin bpf programs, but if you are not familiar with k8s, it is not easy to explain clearly why it is a trouble for deployment. But I can try to explain to you from a *systemd user's* perspective. bpf-memcg (must be persistent) / \ bpf-foo-memcg bpf-bar-memcg (must be persistent, and limit here) ------------------------------------------------------- / \ bpf-foo pod bpf-bar pod (being created and destroyed, but not limited) I assume the above hierarchy is what you expect. But you know, in the k8s environment, everything is pod-based, that means if we use the above hierarchy in the k8s environment, the k8s's limiting, monitoring, debugging must be changed consequently. That means it may be a fullstack change in k8s, a great refactor. So below hierarchy is a reasonable solution, bpf-memcg | bpf-foo pod bpf-foo-memcg (limited) / \ / (charge) (not-charged) (charged) proc-foo bpf-foo And then keep the bpf-memgs persistent. -- Regards Yafang