On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote: > On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@xxxxxx> wrote: > > As I said, there are 2 problems with charging systemd (or a similar daemon): > > 1) It often belongs to the root cgroup. > This doesn't hold for systemd (if we agree that systemd is the most > common case). Ok, it's better. > > > 2) OOMing or failing some random memory allocations is a bad way > > to "communicate" a memory shortage to the daemon. > > What we really want is to prevent creating a huge number of cgroups > There's cgroup.max.descendants for that... cgroup.max.descendants limits the number of live cgroups, it can't limit the number of dying cgroups. > > > (including dying cgroups) in some specific sub-tree(s). > ...oh, so is this limiting the number of cgroups or limiting resources > used? My scenario is simple: there is a large machine, which has no memory pressure for some time (e.g. is idle or running a workload with small working set). Periodically running services creating a lot of cgroups, usually in system.slice. After some time a significant part of the whole memory is getting consumed by dying cgroups and their percpu data. Getting rid of it and reclaiming all memory is not always possible (percpu is getting fragmented relatively easy) and is time consuming. If we'll set memory.high on system.slice, it will create an artificial memory pressure once we're getting close to the limit. It will trigger the reclaim of user pages and slab objects, so eventually we'll be able to release dying cgroups as well. You might say that it would work even without charging memcg internal structures. The problem is that a small slab object can indirectly pin a lot of (percpu) memory. If don't take the indirectly pinned memory into account, likely we won't apply enough memory pressure. If we'll limit init.slice (where systemd seems to reside), as you suggest, we'll eventually create trashing in init.slice, followed by OOM. I struggle to see how it makes the life of a user better? > > > OOMing the daemon or returning -ENOMEM to some random syscalls > > will not help us to reach the goal and likely will bring a bad > > experience to a user. > If we reach the situation when memory for cgroup operations is tight, > it'll disappoint the user either way. > My premise is that a running workload is more valuable than the > accompanying manager. The problem is that OOM-killing the accompanying manager won't release resources and help to get rid of accumulated cgroups. So in the very best case it will prevent new cgroups from being created (as well as some other random operations from being performed). Most likely the only way to "fix" this for a user will be to reboot the machine. > > > In a generic case I don't see how we can charge the cgroup which > > creates cgroups without solving these problems first. > In my understanding, "onbehalveness" is a concept useful for various > kernel threads doing deferred work. Here it's promoted to user processes > managing cgroups. > > > And if there is a very special case where we have to limit it, > > we can just add an additional layer: > > > > ` root or delegated root > > ` manager-parent-cgroup-with-a-limit > > ` manager-cgroup (systemd, docker, ...) > > ` [aggregation group(s)] > > ` job-group-1 > > ` ... > > ` job-group-n > If the charge goes to the parent of created cgroup (job-cgroup-i here), > then the layer adds nothing. Am I missing something? Sorry, I was wrong here, please ignore this part. > > > I'd definitely charge the parent cgroup in all similar cases. > (This would mandate the controllers on the unified hierarchy, which is > fine IMO.) Then the order of enabling controllers on a subtree (e.g. > cpu,memory vs memory,cpu) by the manager would yield different charging. > This seems wrong^W confusing to me. I agree it's confusing. Thanks!