On Fri, Oct 20, 2023 at 02:11:40PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > > > [...snip...] > > Example 2: A dual-socket system with 1 CXL device per socket > > === > > CPU Nodes: node0, node1 > > CXL Nodes: node2, node3 (on sockets 0 and 1 respective) > > [...snip...] > > This is similar to example #1, but with one difference: A task running > > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3. [...snip...] > > This leaves us with weights of: > > > > node0 - 57% > > node1 - 26% > > node2 - 12% > > node3 - 5% > > > > Does the workload run on CPU of node 0 only? This appears unreasonable. Depends. if a user explicitly launches with `numactl --cpunodebind=0` then yes, you can force a task (and all its children) to run on node0. If a workload multi-threaded enough to run on both sockets, then you are right that you'd want to basically limit cross-socket traffic by binding individual threads to nodes that don't cross sockets - if at all feasible this may not be feasible). But at that point, we're getting into the area of numa-aware software. That's a bit beyond the scope of this - which is to enable a coarse grained interleaving solution that can easily be accessed with something like `numactl --interleave` or `numactl --weighted-interleave`. > If the memory bandwidth requirement of the workload is so large that CXL > is used to expand bandwidth, why not run workload on CPU of node 1 and > use the full memory bandwidth of node 1? Settings are NOT one size fits all. You can certainly come up with another scenario in which these weights are not optimal. If we're running enough threads that we need multiple sockets to run them concurrently, then the memory distribution weights become much more complex. Without more precise control over task placement and preventing task migration, you can't really get an "optimal" placement. What I'm really saying is "Task placement is a more powerful function for predicting performance than memory placement". However, user software would need to implement a pseudo-scheduler and explicit data placement to be the most optimized. Beyond this, there is only so much we can do from a `numactl` perspective. tl;dr: We can't get a perfect system here, because getting a best-case for all possible scenarios is an probably undecidable problem. You will always be able to generate an example wherein the system is not optimal. > > If the workload run on CPU of node 0 and node 1, then the cross-socket > traffic should be minimized if possible. That is, threads/processes on > node 0 should interleave memory of node 0 and node 2, while that on node > 1 should interleave memory of node 1 and node 3. This can be done with set_mempolicy() with MPOL_INTERLEAVE and set the nodemask to the what you describe. Those tasks need to also prevent themselves from being migrated as well. But this can absolutely be done. In this scenario, the weights need to be re-calculated to be based on the bandwidth of the nodes in the mempolicy nodemask, which is what i described in the last email. ~Gregory