On Fri, Dec 20, 2024 at 05:25:28PM +0900, Hyeonggon Yoo wrote: > On 2024-12-20 4:18 AM, Joshua Hahn wrote: ... snip ... > > By the way, this might be out of scope, but let me ask for my own > learning. > > We have a server with 2 sockets, each attached with local DRAM and CXL > memory (and thus 4 NUMA nodes). When accessing remote socket's memory > (either CXL or not), the bandwidth is limited by the interconnect's > bandwidth. > > On this server, ideally weighted interleaving should be configured > within a socket (e.g. local NUMA node + local CXL node) because > weighted interleaving does not consider the bandwidth when accessed > from a remote socket. > > So, the question is: On systems with multiple sockets (and CXL mem > attached to each socket), do you always assume the admin must bind to > a specific socket for optimal performance or is there any plan to > mitigate this problem without binding tasks to a socket? > There was a long discussion about this when initially implementing the weighted interleave mechanism. The answer is basically that interleave/weighted-interleave is suboptimal for this scenario for a few reasons. 1) The "effective bandwidth" of a given node is relative *to the task* Imagine: A----B | | C D Task 1 on A has a different effective bandwidth from A->D than Task 2 running on B. There's no good way for us to capture this information in global weights because... 2) We initially explored implementing a matrix of weights (cpu-relative) This had little support - so it was simplied to a single array. 3) We also explored task-local weights to allow capturing this info. This required new syscalls, and likewise had little support. 4) It's unclear how we can actually acquire cross-connect bandwidth information anyway, and it's further unclear how this would be used in an automated fashion to do "something reasonable" for the user. 5) The actual use cases for weighted-interleave on multi-socket systems was questionable due to the above - so we more or less discarded the idea as untennable at best (or at least in need of much more thought) So in short, yes, if the admin wants to be good use of (weighted) interleave, they should bind to one socket and its attached CXL memory only - otherwise the hidden chokepoint of the cross-socket interconnect may bite them. For now the best we can do is create global-relative weights, which mathematically reduce according to bandwidth within a nodemask if the task binds itself to a single socket. ~Gregory