On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote: > On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@xxxxxx> wrote: > > In general, yes. But in this case I think it wouldn't be a good idea: > > most often cgroups are created by a centralized daemon (systemd), > > which is usually located in the root cgroup. Even if it's located not in > > the root cgroup, limiting it's memory will likely affect the whole system, > > even if only one specific limit was reached. > The generic scheme would be (assuming the no internal process > constraint, in the root too) > > ` root or delegated root > ` manager-cgroup (systemd, docker, ...) > ` [aggregation group(s)] > ` job-group-1 > ` ... > ` job-group-n > > > If there is a containerized workload, which creates sub-cgroups, > > charging it's parent cgroup is perfectly effective. > No dispute about this in either approaches. > > > And the opposite, if we'll charge the cgroup of a process, who created > > a cgroup, we'll not cover the most common case: systemd creating > > cgroups for all services in the system. > What I mean is that systemd should be charged for the cgroup creation. > Or more generally, any container/cgroup manager should be charged. > Consider a leak when it wouldn't remove spent cgroups, IMO the effect > (throttling, reclaim) should be exercised on such a culprit. As I said, there are 2 problems with charging systemd (or a similar daemon): 1) It often belongs to the root cgroup. 2) OOMing or failing some random memory allocations is a bad way to "communicate" a memory shortage to the daemon. What we really want is to prevent creating a huge number of cgroups (including dying cgroups) in some specific sub-tree(s). OOMing the daemon or returning -ENOMEM to some random syscalls will not help us to reach the goal and likely will bring a bad experience to a user. In a generic case I don't see how we can charge the cgroup which creates cgroups without solving these problems first. And if there is a very special case where we have to limit it, we can just add an additional layer: ` root or delegated root ` manager-parent-cgroup-with-a-limit ` manager-cgroup (systemd, docker, ...) ` [aggregation group(s)] ` job-group-1 ` ... ` job-group-n > > I don't think the existing workload (job-group-i above) should > unnecessarily suffer when only manager is acting up. Is that different > from your idea? > > > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups > > in completely different cgroup. In this particular case there are tons of other > > ways how a task from C00 can hurt C1. > I agree with that. > > > If I haven't overlooked anything, this should be first case when > cgroup-related structures are accounted (please correct me). > So this is setting a precendent, if others show useful to be accounted > in the future too. Right. > I'm thinking about cpu_cgroup_css_alloc() that can > also allocate a lot (with big CPU count). The current approach would lead > situations where matching cpu and memory csses needn't to exist and that > would need special handling. I'd definitely charge the parent cgroup in all similar cases. > > > > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote: > > > These week-old issues appear to be significant. Roman? Or someone > > > else? > Despite my concerns, I don't think this is fundamental and can't be > changed later so it doesn't prevent the inclusion in 5.9 RC1. Thank you!