Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Gregory Price <gregory.price@xxxxxxxxxxxx> · Mon, 16 Oct 2023 21:28:33 -0400

On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@xxxxxxxxx> writes:
> 
> > == Mutex to Semaphore change:
> >
> > Since it is expected that many threads will be accessing this data
> > during allocations, a mutex is not appropriate.
> 
> IIUC, this is a change for performance.  If so, please show some
> performance data.
>

This change will be dropped in v3 in favor of the existing
RCU mechanism in memory-tiers.c as pointed out by Matthew.

> > == Source-node relative weighting:
> >
> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
> >    echo source_node:weight > /path/to/interleave_weight
> 
> If source_node is considered, why not consider target_node too?  On a
> system with only 1 tier (DRAM), do you want weighted interleaving among
> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
> Why not just introduce weighted interleaving for NUMA nodes?
>

The short answer: Practicality and ease-of-use.

The long answer: We have been discussing how to make this more flexible..

Personally, I agree with you.  If Task A is on Socket 0, the weight on
Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
However, right now, DRAM nodes are lumped into the same tier together,
resulting in them having the same weight.

If you scrollback through the list, you'll find an RFC I posted for
set_mempolicy2 which implements weighted interleave in mm/mempolicy.
However, mm/mempolicy is extremely `current-centric` at the moment,
so that makes changing weights at runtime (in response to a hotplug
event, for example) very difficult.

I still think there is room to extend set_mempolicy to allow
task-defined weights to take preference over tier defined weights.

We have discussed adding the following features to memory-tiers:

1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
   to lumping all nodes of a simlar quality into the same tier

2) enabling movemnet of nodes between tiers (for the purpose of
   reconfiguring due to hotplug and other situations)

For users that require fine-grained control over each individual node,
this would allow for weights to be applied per-node, because a
node=tier. For the majority of use cases, it would allow clumping of
nodes into tiers based on physical topology and performance class, and
then allow for the general weighting to apply.  This seems like the most
obvious use-case that a majority of users would use, and also the
easiest to set-up in the short term.

That said, there are probably 3 or 4 different ways/places to implement
this feature.  The question is what is the clear and obvious way?
I don't have a definitive answer for that, hence the RFC.

There are at least 5 proposals that i know of at the moment

1) mempolicy
2) memory-tiers
3) memory-block interleaving? (weighting among blocks inside a node)
   Maybe relevant if Dynamic Capacity devices arrive, but it seems
   like the wrong place to do this.
4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
5) "just do it in hardware"

> > # Set tier4 weight from node 0 to 85
> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier4 weight from node 1 to 65
> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier22 weight from node 0 to 15
> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> > # Set tier22 weight from node 1 to 10
> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> 
> --
> Best Regards,
> Huang, Ying