Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning

Gregory Price <gourry@xxxxxxxxxx> · Fri, 20 Dec 2024 09:13:50 -0500

On Fri, Dec 20, 2024 at 05:25:28PM +0900, Hyeonggon Yoo wrote:
> On 2024-12-20 4:18 AM, Joshua Hahn wrote:
... snip ...
> 
> By the way, this might be out of scope, but let me ask for my own
> learning.
> 
> We have a server with 2 sockets, each attached with local DRAM and CXL
> memory (and thus 4 NUMA nodes). When accessing remote socket's memory
> (either CXL or not), the bandwidth is limited by the interconnect's
> bandwidth.
> 
> On this server, ideally weighted interleaving should be configured
> within a socket (e.g. local NUMA node + local CXL node) because
> weighted interleaving does not consider the bandwidth when accessed
> from a remote socket.
> 
> So, the question is: On systems with multiple sockets (and CXL mem
> attached to each socket), do you always assume the admin must bind to
> a specific socket for optimal performance or is there any plan to
> mitigate this problem without binding tasks to a socket?
>

There was a long discussion about this when initially implementing the
weighted interleave mechanism.

The answer is basically that interleave/weighted-interleave is
suboptimal for this scenario for a few reasons.

1) The "effective bandwidth" of a given node is relative *to the task*

   Imagine:
          A----B
          |    |
          C    D

   Task 1 on A has a different effective bandwidth from A->D than
   Task 2 running on B.  There's no good way for us to capture this
   information in global weights because...

2) We initially explored implementing a matrix of weights (cpu-relative)
   This had little support - so it was simplied to a single array.

3) We also explored task-local weights to allow capturing this info. 
   This required new syscalls, and likewise had little support.

4) It's unclear how we can actually acquire cross-connect bandwidth
   information anyway, and it's further unclear how this would be used
   in an automated fashion to do "something reasonable" for the user.

5) The actual use cases for weighted-interleave on multi-socket systems
   was questionable due to the above - so we more or less discarded the
   idea as untennable at best (or at least in need of much more thought)

So in short, yes, if the admin wants to be good use of (weighted)
interleave, they should bind to one socket and its attached CXL memory
only - otherwise the hidden chokepoint of the cross-socket interconnect
may bite them.

For now the best we can do is create global-relative weights, which
mathematically reduce according to bandwidth within a nodemask if the
task binds itself to a single socket.

~Gregory