Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@xxxxxxxxxxxx> · Fri, 10 Nov 2023 16:24:34 -0500

On Fri, Nov 10, 2023 at 10:05:57AM +0100, Michal Hocko wrote:
> On Thu 09-11-23 11:34:01, Gregory Price wrote:
> [...]
> > Anyway, summarizing:  After a bit of reading, this does seem to map
> > better to the "accounting consumption" subsystem than the "constrain"
> > subsystem. However, if you think it's better suited for cpuset, I'm
> > happy to push in that direction.
> 
> Maybe others see it differently but I stick with my previous position.
> Memcg is not a great fit for reasons already mentioned - most notably
> that the controller doesn't control the allocation but accounting what
> has been already allocated. Cpusets on the other hand constrains the
> allocations and that is exactly what you want to achieve. 
> -- 
> Michal Hocko
> SUSE Labs

Digging in a bit, placing it in cpusets has locking requirements that
concerns me.  Maybe I'm being a bit over-cautious, so if none of this
matters, then I'll go ahead and swap the code over to cpusets.
Otherwise, just more food for thought in cpusets vs memcg.

In cpusets.c it states when acquiring read-only access, we have to
acquire the (global) callback lock:

https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L391
* There are two global locks guarding cpuset structures - cpuset_mutex and
* callback_lock. We also require taking task_lock() when dereferencing a
* task's cpuset pointer. See "The task_lock() exception", at the end of this
* comment.

Examples:

cpuset_node_allowed:
https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4780
    spin_lock_irqsave(&callback_lock, flags);
    rcu_read_lock();
    cs = nearest_hardwall_ancestor(task_cs(current)); <-- walks parents
    allowed = node_isset(node, cs->mems_allowed);
    rcu_read_unlock();
    spin_unlock_irqrestore(&callback_lock, flags);

cpuset_mems_allowed:
https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4679
    spin_lock_irqsave(&callback_lock, flags);
    rcu_read_lock();
    guarantee_online_mems(task_cs(tsk), &mask); <-- walks parents
    rcu_read_unlock();
    spin_unlock_irqrestore(&callback_lock, flags);

Seems apparent that any form of parent walk in cpusets will require the
acquisition of &callback_lock.  This does not appear true of memcg.

Implementing a similar inheritance structure as described in this patch
set would therefore cause the acquisition of the callback lock during
node selection. So if we want this in cpuset, we're going to eat that
lock acquisition, despite not really needing it.

I'm was not intending to do checks against cpusets.mems_allowed when
acquiring weights, as this is already enforced between cpusets and
mempolicy on hotplug and mask changes, as well as in the allocators via
read_mems_allowed_begin/retry..

This is why I said this was *not* a constraining feature. Additionally,
if the node selected by mpol is exhausted, the allocator will simply
acquire memory from another (allowed) node, disregarding the weights
entirely (which is the correct / expected behavior). Another example of
"this is more of a suggestion" rather than a constraint.

So I'm contending here that putting it in cpusets is overkill.  But if
it likewise doesn't fit in memcg, is it insane to suggest that maybe we
should consider adding cgroup.mpol, and maybe consider migrating features
from mempolicy.c into cgroups (while keeping mpol the way it is).

~Gregory