Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Gregory Price <gregory.price@xxxxxxxxxxxx> · Mon, 26 Feb 2024 09:29:59 -0500

On Fri, Feb 23, 2024 at 05:11:23PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
>

(sorry for the re-send, error replying to list)

> >> > +	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
> >> > +	for (i = 0; i < nr_node_ids; i++) {
> >> > +		if (new_bw[i])
> >> > +			new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
> 
> IIUC, the sum of interleave weights of all nodes will be 100.  If there
> are more than 100 nodes in the system, this doesn't work properly.  How
> about use some fixed number like "16" for DRAM node?
>

I suppose we could add a "type" value into the interface that says
what approximate "tier" a node is in, or we could ask the tiering
component for that information.  But what does this actually change?

You still calculate the percentage of bandwidth provided by each node,
and then just apply that to the larger default number. I don't see the
point in that - if each node provides less than 1% of the overall system
bandwidth, and larger numbers won't do much. In fact, we want smaller
numbers to spread spacially local data out more aggressively.

More important question: In what world is a large numa system liabile
to use this interface to any real benefit?

I'd briefly considered this, but I strayed away from supporting that
case.  Probably worth documenting, at the very least.

We had the cross-socket interleave discussion previously in the prior
series.  The question above simplifies (complicates?) to:  How useful
is interleave (weighted or not) in cross-socket workloads.

Consider the following configuration:

 ---------   A  --------    C    -------- D  ---------
 | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
 ---------      --------         --------    ---------
	           | B              | E
                --------         --------
                | cxl0 |         | cxl1 |
                --------         --------

Theoretical throughputs

A&D: 512GB/s  (8 channel DDR5)
B&E: 64GB/s   (1 CXL/PCIe5 link)
C  : 62.4GB/s (3x UPI links)

Where are the 100 nodes coming from?

If it's across interconnects (UPI), then the throughput to remote
DRAM is better described by C, not A or D. However, we don't have
that information (maybe we should?).  More importantly... is
interleaving across these links even useful?  I suppose if you did
sub-numa clustering stuff and had an ultra-super-numa-aware piece
of software capable of keeping certain chunks of memory in certain
cores that might be useful.... but then you probably actually want
task-local weights as opposed to using the system default.

Otherwise, does a UPI link actually get the full throughput? Probably
only if the remote memory bus is unloaded.  If the remote bus is
loaded, then link C performance information is basically a lie.

I've been convinced (so far) that cross-socket interconnect
interleaving is not a real use-case unless you intend to only run
your software on a single socket and use the remote socket for
whatever you can swipe over the interconnect. In that case, you're
probably smart enough to set the interleave weights manually.

So what if the nodes are coming from many memory sources down one
or more local CXL links (link B from cpu0).

 ---------   A  --------
 | DRAM0 | ---- | cpu0 |
 ---------      --------
	           | B 
      ----------------------------
      |                          |
  --------                    --------
  | cxl0 |       ......       | cxlN |
  --------                    --------

In that case it would be better for many reasons to reconfigure the
system to combine those nodes into fewer nodes via a hardware interleave
set.  This can be done in hardware (at a switch), in BIOS (at the root
complex), or by the CXL Driver.  The result is fewer nodes, and the real
performance of that node can be calculated by the drivers and repoted
accordingly.

So coming back to this code:  Then why am I doing GCD across all
nodes, rather than taking the full topology into account?  Mostly
because the topological information is not easily available, would
be complex to communicate across components, and the full reduction
is a decent approximation anyway.

Example from above using real HMAT reported numbers

A&D: 176100
B&E: 60000
C:   Not a node, no information available.

Produces Node Weights

Calculating total system weighted averagee
A:37  D:37  B:12  E:12  (37 is prime so no reductions possible)

Calculating local-node relationships only
A:74--B:25  D:74--E:25  (GCD is 1, so no reductions possible)

Notice that 12+37 = 49 -  12/49 = 24%

So the ratios end up working out basically the same anyway, but
the smaller numbers produced by averaging over the entire system
are preferable to the "topologically aware" numbers anyway.

Obviously this breaks in a "large numa system" - but again...
is this even useful for those systems anyway? I contend: No.

This is still reasonable accurate in non-hogeneous systems

 ---------   A  --------    C    -------- D  ---------
 | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
 ---------      --------         --------    ---------
	           | B
                --------
                | cxl0 |
                --------

In this system the numbers work out to:

Global:  A:42  B:14  D: 42  (GCD: 14)
Reduce:  A:3   B:1   D: 3

A user doing `-w --interleave=A,B` will get a ratio of 3:1, which
is pretty much spot on.

So, long winded winded way of saying:
- Could we use a larger default number? Yes.
- Does that actually help us? Not really, we want smaller numbers.
- Does this reduce to normal-interleave under large-numa systems? Yes.
- Does that matter? Probably not. It doesn't seem like a real use case.
- What if it is?  The workloads probably want task-local weights anyway.

> >
> > In this scenario, I'm not sure what to do.  We must have a non-0 value
> > for that device (to avoid div-by-0), but setting an abitrarily large
> > value also seems bad.
> 
> I think that it's kind of reasonable to use DRAM bandwidth for device
> without data.  If there are only DRAM nodes and nodes without data, this
> will make interleave weight to "1".
>

Yes, those nodes would reduce to 1.  Which is pretty much the best we can
do without accounting for interconnects - which as discussed above is not
really useful anyway.

I think I'll draft up an LSF/MM chat to see if we can garner more input.
If large-numa systems are a real issue, then yes we need to address it.

~Gregory