Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Fri, Oct 20, 2023 at 02:11:40PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@xxxxxxxxxxxx> writes: >> >> > > [...snip...] >> > Example 2: A dual-socket system with 1 CXL device per socket >> > === >> > CPU Nodes: node0, node1 >> > CXL Nodes: node2, node3 (on sockets 0 and 1 respective) >> > > [...snip...] >> > This is similar to example #1, but with one difference: A task running >> > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3. > [...snip...] >> > This leaves us with weights of: >> > >> > node0 - 57% >> > node1 - 26% >> > node2 - 12% >> > node3 - 5% >> > >> >> Does the workload run on CPU of node 0 only? This appears unreasonable. > > Depends. if a user explicitly launches with `numactl --cpunodebind=0` > then yes, you can force a task (and all its children) to run on node0. IIUC, in your example, the `numactl` command line will be numactl --cpunodebind=0 --weighted-interleave=0,1,2,3 That is, the CPU is restricted to node 0, while memory is distributed to all nodes. This doesn't sound like reasonable for me. > If a workload multi-threaded enough to run on both sockets, then you are > right that you'd want to basically limit cross-socket traffic by binding > individual threads to nodes that don't cross sockets - if at all > feasible this may not be feasible). > > But at that point, we're getting into the area of numa-aware software. > That's a bit beyond the scope of this - which is to enable a coarse > grained interleaving solution that can easily be accessed with something > like `numactl --interleave` or `numactl --weighted-interleave`. > >> If the memory bandwidth requirement of the workload is so large that CXL >> is used to expand bandwidth, why not run workload on CPU of node 1 and >> use the full memory bandwidth of node 1? > > Settings are NOT one size fits all. You can certainly come up with another > scenario in which these weights are not optimal. > > If we're running enough threads that we need multiple sockets to run > them concurrently, then the memory distribution weights become much more > complex. Without more precise control over task placement and > preventing task migration, you can't really get an "optimal" placement. > > What I'm really saying is "Task placement is a more powerful function > for predicting performance than memory placement". However, user > software would need to implement a pseudo-scheduler and explicit data > placement to be the most optimized. Beyond this, there is only so much > we can do from a `numactl` perspective. > > tl;dr: We can't get a perfect system here, because getting a best-case > for all possible scenarios is an probably undecidable problem. You will > always be able to generate an example wherein the system is not optimal. > >> >> If the workload run on CPU of node 0 and node 1, then the cross-socket >> traffic should be minimized if possible. That is, threads/processes on >> node 0 should interleave memory of node 0 and node 2, while that on node >> 1 should interleave memory of node 1 and node 3. > > This can be done with set_mempolicy() with MPOL_INTERLEAVE and set the > nodemask to the what you describe. Those tasks need to also prevent > themselves from being migrated as well. But this can absolutely be > done. > > In this scenario, the weights need to be re-calculated to be based on > the bandwidth of the nodes in the mempolicy nodemask, which is what i > described in the last email. IMHO, we should keep thing as simple as possible, only add complexity if necessary. -- Best Regards, Huang, Ying