On Thu 19-12-19 15:07:17, Johannes Weiner wrote: > The effective protection of any given cgroup is a somewhat complicated > construct that depends on the ancestor's configuration, siblings' > configurations, as well as current memory utilization in all these > groups. It's done this way to satisfy hierarchical delegation > requirements while also making the configuration semantics flexible > and expressive in complex real life scenarios. > > Unfortunately, all the rules and requirements are sparsely documented, > and the code is a little too clever in merging different scenarios > into a single min() expression. This makes it hard to reason about the > implementation and avoid breaking semantics when making changes to it. > > This patch documents each semantic rule individually and splits out > the handling of the overcommit case from the regular case. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> It took me a while to double check that the new code is indeed equivalent but the result is easier to grasp indeed. Acked-by: Michal Hocko <mhocko@xxxxxxxx> > --- > mm/memcontrol.c | 190 ++++++++++++++++++++++++++---------------------- > 1 file changed, 104 insertions(+), 86 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 874a0b00f89b..9c771c4d6339 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = { > .early_init = 0, > }; > > -/** > - * mem_cgroup_protected - check if memory consumption is in the normal range > - * @root: the top ancestor of the sub-tree being checked > - * @memcg: the memory cgroup to check > - * > - * WARNING: This function is not stateless! It can only be used as part > - * of a top-down tree iteration, not for isolated queries. > - * > - * Returns one of the following: > - * MEMCG_PROT_NONE: cgroup memory is not protected > - * MEMCG_PROT_LOW: cgroup memory is protected as long there is > - * an unprotected supply of reclaimable memory from other cgroups. > - * MEMCG_PROT_MIN: cgroup memory is protected > - * > - * @root is exclusive; it is never protected when looked at directly > +/* > + * This function calculates an individual cgroup's effective > + * protection which is derived from its own memory.min/low, its > + * parent's and siblings' settings, as well as the actual memory > + * distribution in the tree. > * > - * To provide a proper hierarchical behavior, effective memory.min/low values > - * are used. Below is the description of how effective memory.low is calculated. > - * Effective memory.min values is calculated in the same way. > + * The following rules apply to the effective protection values: > * > - * Effective memory.low is always equal or less than the original memory.low. > - * If there is no memory.low overcommittment (which is always true for > - * top-level memory cgroups), these two values are equal. > - * Otherwise, it's a part of parent's effective memory.low, > - * calculated as a cgroup's memory.low usage divided by sum of sibling's > - * memory.low usages, where memory.low usage is the size of actually > - * protected memory. > + * 1. At the first level of reclaim, effective protection is equal to > + * the declared protection in memory.min and memory.low. > * > - * low_usage > - * elow = min( memory.low, parent->elow * ------------------ ), > - * siblings_low_usage > + * 2. To enable safe delegation of the protection configuration, at > + * subsequent levels the effective protection is capped to the > + * parent's effective protection. > * > - * low_usage = min(memory.low, memory.current) > + * 3. To make complex and dynamic subtrees easier to configure, the > + * user is allowed to overcommit the declared protection at a given > + * level. If that is the case, the parent's effective protection is > + * distributed to the children in proportion to how much protection > + * they have declared and how much of it they are utilizing. > * > + * This makes distribution proportional, but also work-conserving: > + * if one cgroup claims much more protection than it uses memory, > + * the unused remainder is available to its siblings. > * > - * Such definition of the effective memory.low provides the expected > - * hierarchical behavior: parent's memory.low value is limiting > - * children, unprotected memory is reclaimed first and cgroups, > - * which are not using their guarantee do not affect actual memory > - * distribution. > + * Consider the following example tree: > * > - * For example, if there are memcgs A, A/B, A/C, A/D and A/E: > + * A A/memory.low = 2G, A/memory.current = 6G > + * //\\ > + * BC DE B/memory.low = 3G B/memory.current = 2G > + * C/memory.low = 1G C/memory.current = 2G > + * D/memory.low = 0 D/memory.current = 2G > + * E/memory.low = 10G E/memory.current = 0 > * > - * A A/memory.low = 2G, A/memory.current = 6G > - * //\\ > - * BC DE B/memory.low = 3G B/memory.current = 2G > - * C/memory.low = 1G C/memory.current = 2G > - * D/memory.low = 0 D/memory.current = 2G > - * E/memory.low = 10G E/memory.current = 0 > + * and memory pressure is applied, the following memory > + * distribution is expected (approximately*): > * > - * and the memory pressure is applied, the following memory distribution > - * is expected (approximately): > + * A/memory.current = 2G > + * B/memory.current = 1.3G > + * C/memory.current = 0.6G > + * D/memory.current = 0 > + * E/memory.current = 0 > * > - * A/memory.current = 2G > + * *assuming equal allocation rate and reclaimability > * > - * B/memory.current = 1.3G > - * C/memory.current = 0.6G > - * D/memory.current = 0 > - * E/memory.current = 0 > + * 4. Conversely, when the declared protection is undercommitted at a > + * given level, the distribution of the larger parental protection > + * budget is NOT proportional. A cgroup's protection from a sibling > + * is capped to its own memory.min/low setting. > * > * These calculations require constant tracking of the actual low usages > * (see propagate_protected_usage()), as well as recursive calculation of > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = { > * for next usage. This part is intentionally racy, but it's ok, > * as memory.low is a best-effort mechanism. > */ > +static unsigned long effective_protection(unsigned long usage, > + unsigned long setting, > + unsigned long parent_effective, > + unsigned long siblings_protected) > +{ > + unsigned long protected; > + > + protected = min(usage, setting); > + /* > + * If all cgroups at this level combined claim and use more > + * protection then what the parent affords them, distribute > + * shares in proportion to utilization. > + * > + * We are using actual utilization rather than the statically > + * claimed protection in order to be work-conserving: claimed > + * but unused protection is available to siblings that would > + * otherwise get a smaller chunk than what they claimed. > + */ > + if (siblings_protected > parent_effective) > + return protected * parent_effective / siblings_protected; > + > + /* > + * Ok, utilized protection of all children is within what the > + * parent affords them, so we know whatever this child claims > + * and utilizes is effectively protected. > + * > + * If there is unprotected usage beyond this value, reclaim > + * will apply pressure in proportion to that amount. > + * > + * If there is unutilized protection, the cgroup will be fully > + * shielded from reclaim, but we do return a smaller value for > + * protection than what the group could enjoy in theory. This > + * is okay. With the overcommit distribution above, effective > + * protection is always dependent on how memory is actually > + * consumed among the siblings anyway. > + */ > + return protected; > +} > + > +/** > + * mem_cgroup_protected - check if memory consumption is in the normal range > + * @root: the top ancestor of the sub-tree being checked > + * @memcg: the memory cgroup to check > + * > + * WARNING: This function is not stateless! It can only be used as part > + * of a top-down tree iteration, not for isolated queries. > + * > + * Returns one of the following: > + * MEMCG_PROT_NONE: cgroup memory is not protected > + * MEMCG_PROT_LOW: cgroup memory is protected as long there is > + * an unprotected supply of reclaimable memory from other cgroups. > + * MEMCG_PROT_MIN: cgroup memory is protected > + */ > enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, > struct mem_cgroup *memcg) > { > struct mem_cgroup *parent; > - unsigned long emin, parent_emin; > - unsigned long elow, parent_elow; > unsigned long usage; > > if (mem_cgroup_disabled()) > @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, > if (!usage) > return MEMCG_PROT_NONE; > > - emin = memcg->memory.min; > - elow = memcg->memory.low; > - > parent = parent_mem_cgroup(memcg); > /* No parent means a non-hierarchical mode on v1 memcg */ > if (!parent) > return MEMCG_PROT_NONE; > > - if (parent == root) > - goto exit; > - > - parent_emin = READ_ONCE(parent->memory.emin); > - emin = min(emin, parent_emin); > - if (emin && parent_emin) { > - unsigned long min_usage, siblings_min_usage; > - > - min_usage = min(usage, memcg->memory.min); > - siblings_min_usage = atomic_long_read( > - &parent->memory.children_min_usage); > - > - if (min_usage && siblings_min_usage) > - emin = min(emin, parent_emin * min_usage / > - siblings_min_usage); > + if (parent == root) { > + memcg->memory.emin = memcg->memory.min; > + memcg->memory.elow = memcg->memory.low; > + goto out; > } > > - parent_elow = READ_ONCE(parent->memory.elow); > - elow = min(elow, parent_elow); > - if (elow && parent_elow) { > - unsigned long low_usage, siblings_low_usage; > - > - low_usage = min(usage, memcg->memory.low); > - siblings_low_usage = atomic_long_read( > - &parent->memory.children_low_usage); > + memcg->memory.emin = effective_protection(usage, > + memcg->memory.min, READ_ONCE(parent->memory.emin), > + atomic_long_read(&parent->memory.children_min_usage)); > > - if (low_usage && siblings_low_usage) > - elow = min(elow, parent_elow * low_usage / > - siblings_low_usage); > - } > - > -exit: > - memcg->memory.emin = emin; > - memcg->memory.elow = elow; > + memcg->memory.elow = effective_protection(usage, > + memcg->memory.low, READ_ONCE(parent->memory.elow), > + atomic_long_read(&parent->memory.children_low_usage)); > > - if (usage <= emin) > +out: > + if (usage <= memcg->memory.emin) > return MEMCG_PROT_MIN; > - else if (usage <= elow) > + else if (usage <= memcg->memory.elow) > return MEMCG_PROT_LOW; > else > return MEMCG_PROT_NONE; > -- > 2.24.1 > -- Michal Hocko SUSE Labs