Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning

Gregory Price <gourry@xxxxxxxxxx> · Sun, 22 Dec 2024 12:03:22 -0500

On Sun, Dec 22, 2024 at 03:21:32PM +0800, Huang, Ying wrote:
> Hyeonggon Yoo <hyeonggon.yoo@xxxxxx> writes:
> 
> > On this server, ideally weighted interleaving should be configured
> > within a socket (e.g. local NUMA node + local CXL node) because
> > weighted interleaving does not consider the bandwidth when accessed
> > from a remote socket.
> 
> If multiple sockets are considered, what is the best behavior?
> 
> The process may be cross-socket too.  So, we will need to use
> set_mempolicy() to bind tasks to sockets firstly.  Then, it may be
> better to use per-task weights.
>

If we want to revisit this, we might be able to make task-local weights
work without a new syscall, but the use case was not clear enough which
is why it was soft-nak'd originally.

vma-local weights are arguably more usable, but require the task to be
numa-aware and probably require a new mempolicy syscall because mbind
has no remaining arguments.

recall my original testing results from stream:
https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/

Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

Just some context
~Gregory