On Fri, Feb 23, 2024 at 05:11:23PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > (sorry for the re-send, error replying to list) > >> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */ > >> > + for (i = 0; i < nr_node_ids; i++) { > >> > + if (new_bw[i]) > >> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1); > > IIUC, the sum of interleave weights of all nodes will be 100. If there > are more than 100 nodes in the system, this doesn't work properly. How > about use some fixed number like "16" for DRAM node? > I suppose we could add a "type" value into the interface that says what approximate "tier" a node is in, or we could ask the tiering component for that information. But what does this actually change? You still calculate the percentage of bandwidth provided by each node, and then just apply that to the larger default number. I don't see the point in that - if each node provides less than 1% of the overall system bandwidth, and larger numbers won't do much. In fact, we want smaller numbers to spread spacially local data out more aggressively. More important question: In what world is a large numa system liabile to use this interface to any real benefit? I'd briefly considered this, but I strayed away from supporting that case. Probably worth documenting, at the very least. We had the cross-socket interleave discussion previously in the prior series. The question above simplifies (complicates?) to: How useful is interleave (weighted or not) in cross-socket workloads. Consider the following configuration: --------- A -------- C -------- D --------- | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 | --------- -------- -------- --------- | B | E -------- -------- | cxl0 | | cxl1 | -------- -------- Theoretical throughputs A&D: 512GB/s (8 channel DDR5) B&E: 64GB/s (1 CXL/PCIe5 link) C : 62.4GB/s (3x UPI links) Where are the 100 nodes coming from? If it's across interconnects (UPI), then the throughput to remote DRAM is better described by C, not A or D. However, we don't have that information (maybe we should?). More importantly... is interleaving across these links even useful? I suppose if you did sub-numa clustering stuff and had an ultra-super-numa-aware piece of software capable of keeping certain chunks of memory in certain cores that might be useful.... but then you probably actually want task-local weights as opposed to using the system default. Otherwise, does a UPI link actually get the full throughput? Probably only if the remote memory bus is unloaded. If the remote bus is loaded, then link C performance information is basically a lie. I've been convinced (so far) that cross-socket interconnect interleaving is not a real use-case unless you intend to only run your software on a single socket and use the remote socket for whatever you can swipe over the interconnect. In that case, you're probably smart enough to set the interleave weights manually. So what if the nodes are coming from many memory sources down one or more local CXL links (link B from cpu0). --------- A -------- | DRAM0 | ---- | cpu0 | --------- -------- | B ---------------------------- | | -------- -------- | cxl0 | ...... | cxlN | -------- -------- In that case it would be better for many reasons to reconfigure the system to combine those nodes into fewer nodes via a hardware interleave set. This can be done in hardware (at a switch), in BIOS (at the root complex), or by the CXL Driver. The result is fewer nodes, and the real performance of that node can be calculated by the drivers and repoted accordingly. So coming back to this code: Then why am I doing GCD across all nodes, rather than taking the full topology into account? Mostly because the topological information is not easily available, would be complex to communicate across components, and the full reduction is a decent approximation anyway. Example from above using real HMAT reported numbers A&D: 176100 B&E: 60000 C: Not a node, no information available. Produces Node Weights Calculating total system weighted averagee A:37 D:37 B:12 E:12 (37 is prime so no reductions possible) Calculating local-node relationships only A:74--B:25 D:74--E:25 (GCD is 1, so no reductions possible) Notice that 12+37 = 49 - 12/49 = 24% So the ratios end up working out basically the same anyway, but the smaller numbers produced by averaging over the entire system are preferable to the "topologically aware" numbers anyway. Obviously this breaks in a "large numa system" - but again... is this even useful for those systems anyway? I contend: No. This is still reasonable accurate in non-hogeneous systems --------- A -------- C -------- D --------- | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 | --------- -------- -------- --------- | B -------- | cxl0 | -------- In this system the numbers work out to: Global: A:42 B:14 D: 42 (GCD: 14) Reduce: A:3 B:1 D: 3 A user doing `-w --interleave=A,B` will get a ratio of 3:1, which is pretty much spot on. So, long winded winded way of saying: - Could we use a larger default number? Yes. - Does that actually help us? Not really, we want smaller numbers. - Does this reduce to normal-interleave under large-numa systems? Yes. - Does that matter? Probably not. It doesn't seem like a real use case. - What if it is? The workloads probably want task-local weights anyway. > > > > In this scenario, I'm not sure what to do. We must have a non-0 value > > for that device (to avoid div-by-0), but setting an abitrarily large > > value also seems bad. > > I think that it's kind of reasonable to use DRAM bandwidth for device > without data. If there are only DRAM nodes and nodes without data, this > will make interleave weight to "1". > Yes, those nodes would reduce to 1. Which is pretty much the best we can do without accounting for interconnects - which as discussed above is not really useful anyway. I think I'll draft up an LSF/MM chat to see if we can garner more input. If large-numa systems are a real issue, then yes we need to address it. ~Gregory