Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 14 Nov 2023 12:49:36 -0500

On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
> On Tue 14-11-23 10:50:51, Gregory Price wrote:
> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
> [...]
> > > That being said, I still believe that a cgroup based interface is a much
> > > better choice over a global one. Cpusets seem to be a good fit as the
> > > controller does control memory placement wrt NUMA interfaces.
> > 
> > I think cpusets is a non-starter due to the global spinlock required when
> > reading informaiton from it:
> > 
> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
> 
> Right, our current cpuset implementation indeed requires callback lock
> from the page allocator. But that is an implementation detail. I do not
> remember bug reports about the lock being a bottle neck though. If
> anything cpusets lock optimizations would be win also for users who do
> not want to use weighted interleave interface.

Definitely agree, but that's a rather large increase of scope :[

We could consider a push-model similar to how cpuset nodemasks are
pushed down to mempolicies, rather than a pull-model of having
mempolicy read directly from cpusets, at least until cpusets lock
optimization is undertaken.

This pattern looks like a wart to me, which is why I avoided it, but the
locking implications on the pull-model make me sad.

Would like to point out that Tejun pushed back on implementing weights
in cgroups (regardless of subcomponent), so I think we need to come
to a consensus on where this data should live in a "more global"
context (cpusets, memcg, nodes, etc) before I go mucking around
further.

So far we have:
* mempolicy: updating weights is a very complicated undertaking,
             and no (good) way to do this from outside the task.
	     would be better to have a coarser grained control.

             New syscall is likely needed to add/set weights in the
	     per-task mempolicy, or bite the bullet on set_mempolicy2
	     and make the syscall extensible for the future.

* memtiers: tier=node when devices are already interleaved or when all
            devices are different, so why add yet another layer of
	    complexity if other constructs already exist.  Additionally,
	    you lose task-placement relative weighting (or it becomes
	    very complex to implement.

* cgroups: "this doesn't involve dynamic resource accounting /
            enforcement at all" and "these aren't resource
	    allocations, it's unclear what the hierarchical
	    relationship mean".

* node: too global, explore smaller scope first then expand.

For now I think there is consensus that mempolicy should have weights
per-task regardless of how the more-global mechanism is defined, so i'll
go ahead and put up another RFC for some options on that in the next
week or so.

The limitations on the first pass will be that only the task is capable
of re-weighting should cpusets.mems or the nodemask change.

~Gregory