Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 19 Oct 2023 14:28:42 +0800

Gregory Price <gregory.price@xxxxxxxxxxxx> writes:

> On Wed, Oct 18, 2023 at 04:29:02PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
>> 
>> > There are at least 5 proposals that i know of at the moment
>> >
>> > 1) mempolicy
>> > 2) memory-tiers
>> > 3) memory-block interleaving? (weighting among blocks inside a node)
>> >    Maybe relevant if Dynamic Capacity devices arrive, but it seems
>> >    like the wrong place to do this.
>> > 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
>> > 5) "just do it in hardware"
>> 
>> It may be easier to start with the use case.  What is the practical use
>> cases in your mind that can not be satisfied with simple per-memory-tier
>> weight?  Can you compare the memory layout with different proposals?
>>
>
> Before I delve in, one clarifying question:  When you asked whether
> weights should be part of node or memory-tiers, i took that to mean
> whether it should be part of mempolicy or memory-tiers.
>
> Were you suggesting that weights should actually be part of
> drivers/base/node.c?

Yes.  drivers/base/node.c vs. memory tiers.

> Because I had not considered that, and this seems reasonable, easy to
> implement, and would not require tying mempolicy.c to memory-tiers.c
>
>
>
> Beyond this, i think there's been 3 imagined use cases (now, including
> this).
>
> a)
> numactl --weighted-interleave=Node:weight,0:16,1:4,...
>
> b)
> echo weight > /sys/.../memory-tiers/memtier/access0/interleave_weight
> numactl --interleave=0,1
>
> c)
> echo weight > /sys/bus/node/node0/access0/interleave_weight
> numactl --interleave=0,1
>
> d)
> options b or c, but with --weighted-interleave=0,1 instead
> this requires libnuma changes to pick up, but it retains --interleave
> as-is to avoid user confusion.
>
> The downside of an approach like A (which was my original approach), was
> that the weights cannot really change should a node be hotplugged. Tasks
> would need to detect this and change the policy themselves.  That's not
> a good solution.
>
> However in both B and C's design, weights can be rebalanced in response
> to any number of events.  Ultimately B and C are equivalent, but
> the placement in nodes is cleaner and more intuitive.  If memory-tiers
> wants to use/change this information, there's nothing that prevents it.
>
> Assuming this is your meaning, I agree and I will pivot to this.

Can you give a not-so-abstract example?  For example, on a system with
node 0, 1, 2, 3, memory tiers 4 (0, 1), 22 (2, 3), ....  A workload runs
on CPU of node 0, ...., interleaves memory on node 0, 1, ...  Then
compare the different behavior (including memory bandwidth) with node
and memory-tier based solution.

--
Best Regards,
Huang, Ying