Gregory Price <gregory.price@xxxxxxxxxxxx> writes: [snip] > Example 1: A single-socket system with multiple CXL memory devices > === > CPU Node: node0 > CXL Nodes: node1, node2 > > Bandwidth attributes (in theory): > node0 - 8 channels - ~307GB/s > node1 - x16 link - 64GB/s > node2 - x8 link - 32GB/s > > In a system like this, the optimal distribution of memory on an > interleave for maximizing bandwidth is about 76%/16%/8%. > > for the sake of simplicity: --weighted-interleave=0:76,1:16,0:8 > but realistically we could make the weights sysfs values in the node > > Regardless of the mechanism to engage this, the most effective way to > capture this in the system is by applying weights to nodes, not tiers. > If done in tiers, each node would be assigned to its own tier, making > the mechanism equivalent. So you might as well simplify the whole thing > and chop the memtier component out. > > Is this configuration realistic? *shrug* - technically possible. And in > fact most hardware or driver based interleaving mechanisms would not > really be able to manage an interleave region across these nodes, at > least not without placing the x16 driver in x8 mode, or just having the > wrong distribution %'s. > > > > Example 2: A dual-socket system with 1 CXL device per socket > === > CPU Nodes: node0, node1 > CXL Nodes: node2, node3 (on sockets 0 and 1 respective) > > Bandwidth Attributes (in theory): > nodes 0 & 1 - 8 channels - ~307GB/s ea. > nodes 2 & 3 - x16 link - 64GB/s ea. > > This is similar to example #1, but with one difference: A task running > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3. > This is because on access to nodes 1 and 3, the cross-socket link (UPI, > or whatever AMD calls it) becomes a bandwidth chokepoint. > > So from the perspective of node 0, the "real total" available bandwidth > is about 307GB+64GB+(41.6GB * UPI Links) in the case of intel. so the > best result you could get is around 307+64+164=535GB/s if you have the > full 4 links. > > You'd want to distribute the cross-socket traffic proportional to UPI, > not the total. > > This leaves us with weights of: > > node0 - 57% > node1 - 26% > node2 - 12% > node3 - 5% > > Again, naturally nodes are the place to carry the weights here. In this > scenario, placing it in memory-tiers would require that 1 tier per node > existed. Does the workload run on CPU of node 0 only? This appears unreasonable. If the memory bandwidth requirement of the workload is so large that CXL is used to expand bandwidth, why not run workload on CPU of node 1 and use the full memory bandwidth of node 1? If the workload run on CPU of node 0 and node 1, then the cross-socket traffic should be minimized if possible. That is, threads/processes on node 0 should interleave memory of node 0 and node 2, while that on node 1 should interleave memory of node 1 and node 3. But TBH, I lacks knowledge about the real life workloads. So, my understanding may be wrong. Please correct me for any mistake. -- Best Regards, Huang, Ying [snip]