Re: [PATCH] memcg: add hierarchical effective limits for v2

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 10 Feb 2025 17:52:34 -0500

On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > Oh I totally forgot about your series. In my use-case, it is not about
> > dynamically knowning how much they can expand and adjust themselves but
> > rather knowing statically upfront what resources they have been given.
> 
> From the memcg PoV, the effective value doesn't tell how much they were
> given (because of sharing).

It's definitely true that if you have an ancestral limit for several
otherwise unlimited siblings, then interpreting this number as "this
is how much memory I have available" will be completely misleading.

I would also say that sharing a limit with several siblings requires a
certain degree of awareness and cooperation between them. From that
POV, IMO it would be fine to provide a metric with contextual caveats.

The problem is, what do we do with canned, unaware, maybe untrusted
applications? And they don't necessarily know which they are.

It depends heavily on the judgement of the administrator of any given
deployment. Some workloads might be completely untrusted and hard
limited. Another deployment might consider the same workload
reasonably predictable that it's configured only with a failsafe max
limit that is much higher than where the workload is *expected* to
operate. The allotment might happen altogether with min/low
protections and no max limit. Or there could be a combination of
protection slightly below and a limit slightly above the expected
workload size.

It seems basically impossible to write portable code against this
without knowing the intent of the person setting it up.

But how do we communicate intent down to the container? The two broad
options are implicitly or explicitly:

a) Provide a cgroup file that automatically derives intended target
   size from how min/low/high/max are set up.

   Right now those can be set up super loosely depending on what the
   administrator thinks about the application. In order for this to
   work, we'd likely have to define an idiomatic way of configuring
   the controller. E.g. if you set max by itself, we assume this is
   the target size. If you set low, with or without max, then low is
   the target size. Or if you set both, target is in between.

   I'm not completely convinced this is workable. It might require
   settings beyond what's actually needed for the safe containment of
   the workload, which carries the risk of excluding something useful.
   I don't mean enforced configuration rules, but rather the case
   where a configuration is reasonable and effective given the
   workload and environment, but now the target file shows nonsense.

b) Provide a cgroup file that is freely configurable by the
   administrator with the target size of the container.

   This has obvious drawbacks as well. What's the default value? Also,
   a lot of setups are dead simple: set a hard limit and expect the
   workload to adhere to that, period. Nobody is going to reliably set
   another cgroup file that a workload may or may not consume.

The third option is to wash our hands of all of this, provide the
static hierarchy settings to the leaves (like this patch, plus do it
for the other knobs as well) and let userspace figure it out.

Thoughts?