Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Wed, Oct 18, 2023 at 04:29:02PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > There are at least 5 proposals that i know of at the moment >> > >> > 1) mempolicy >> > 2) memory-tiers >> > 3) memory-block interleaving? (weighting among blocks inside a node) >> > Maybe relevant if Dynamic Capacity devices arrive, but it seems >> > like the wrong place to do this. >> > 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...) >> > 5) "just do it in hardware" >> >> It may be easier to start with the use case. What is the practical use >> cases in your mind that can not be satisfied with simple per-memory-tier >> weight? Can you compare the memory layout with different proposals? >> > > Before I delve in, one clarifying question: When you asked whether > weights should be part of node or memory-tiers, i took that to mean > whether it should be part of mempolicy or memory-tiers. > > Were you suggesting that weights should actually be part of > drivers/base/node.c? Yes. drivers/base/node.c vs. memory tiers. > Because I had not considered that, and this seems reasonable, easy to > implement, and would not require tying mempolicy.c to memory-tiers.c > > > > Beyond this, i think there's been 3 imagined use cases (now, including > this). > > a) > numactl --weighted-interleave=Node:weight,0:16,1:4,... > > b) > echo weight > /sys/.../memory-tiers/memtier/access0/interleave_weight > numactl --interleave=0,1 > > c) > echo weight > /sys/bus/node/node0/access0/interleave_weight > numactl --interleave=0,1 > > d) > options b or c, but with --weighted-interleave=0,1 instead > this requires libnuma changes to pick up, but it retains --interleave > as-is to avoid user confusion. > > The downside of an approach like A (which was my original approach), was > that the weights cannot really change should a node be hotplugged. Tasks > would need to detect this and change the policy themselves. That's not > a good solution. > > However in both B and C's design, weights can be rebalanced in response > to any number of events. Ultimately B and C are equivalent, but > the placement in nodes is cleaner and more intuitive. If memory-tiers > wants to use/change this information, there's nothing that prevents it. > > Assuming this is your meaning, I agree and I will pivot to this. Can you give a not-so-abstract example? For example, on a system with node 0, 1, 2, 3, memory tiers 4 (0, 1), 22 (2, 3), .... A workload runs on CPU of node 0, ...., interleaves memory on node 0, 1, ... Then compare the different behavior (including memory bandwidth) with node and memory-tier based solution. -- Best Regards, Huang, Ying