Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@xxxxxxxxxxxx> · Sun, 3 Dec 2023 22:33:08 -0500

On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> 
> Because we usually have multiple nodes in one mem-tier, I still think
> mem-tier-based interface is simpler than node-based.  But, it seems more
> complex to introduce mem-tier into mempolicy.  Especially if we have
> per-task weights.  So, I am fine to go with node-based interface.
> 
> > * cgroups: "this doesn't involve dynamic resource accounting /
> >             enforcement at all" and "these aren't resource
> > 	    allocations, it's unclear what the hierarchical
> > 	    relationship mean".
> >
> > * node: too global, explore smaller scope first then expand.
> 
> Why is it too global?  I understand that it doesn't cover all possible
> use cases (although I don't know whether these use cases are practical
> or not).  But it can provide a reasonable default per-node weight based
> on available node performance information (such as, HMAT, CDAT, etc.).
> And, quite some workloads can just use it.  I think this is an useful
> feature.
>

Have been sharing notes with more folks.  Michal thinks a global set of
weights is unintuitive and not useful, and would prefer to see the
per-task weights first.

Though this may have been in response to adding it as an attribute of
nodes directly. 

Another proposal here suggested adding a new sysfs setting
https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a

  $ tree /sys/kernel/mm/interleave_weight/
  /sys/kernel/mm/interleave_weight/
  ├── enabled [1]
  ├── possible [2]
  └── node
      ├── node0
      │   └── interleave_weight [3]
      └── node1
          └── interleave_weight [3]

(this could be changed to /sys/kernel/mm/mempolicy/...)

I think the internal representation of this can be simplified greatly,
over what the patch provides now, but maybe this solves the "it doesn't
belong in these other components" issue.

Answer: Simply leave it as a static global kobject in mempolicy, which
also deals with many of the issues regarding race conditions.

If a user provides weights, use those.  If they do not, use globals.

On a cpuset rebind event (container migration, mems_allowed changes),
manually set weights would have to remain, so in a bad case, the
weights would be very out of line with the real distribution of memory.

Example: if your nodemask is (0,1,2) and a migration changes it to
(3,4,5), then unfortunately your weights will likely revert to [1,1,1]

If set with global weights, they could automatically adjust.  It
would not be perfect, but it would be better than the potential worst
case above.  If that same migration occurs, the next allocation would
simply use whatever the target node weights are in the global config.

So if globally you have weights [3,2,1,1,2,3], and you move from
nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
[1,2,3].  If the structure is built as a matrix of (cpu_node,mem_nodes),
the you can also optimize based on the node the task is running on.

That feels very intuitive, deals with many race condition issues, and
the global setting can actually be implemented without the need for
set_mempolicy2 at all - which is certainly a bonus.

Would love more thoughts here.  Will have a new RFC with set_mempolicy2,
mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.

Regards
~Gregory