Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 17 Oct 2023 22:47:36 -0400

On Thu, Oct 19, 2023 at 02:28:42PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> > Were you suggesting that weights should actually be part of
> > drivers/base/node.c?
> 
> Yes.  drivers/base/node.c vs. memory tiers.
>

Then yes I agree this can and probably should be placed there,
especially since I see accessor details are now being exposed at that
level, which can be used to auto-generate weights (assuming HMAT/CDAT
data exposed by devices is actually accurate).

> > Assuming this is your meaning, I agree and I will pivot to this.
> 
> Can you give a not-so-abstract example?  For example, on a system with
> node 0, 1, 2, 3, memory tiers 4 (0, 1), 22 (2, 3), ....  A workload runs
> on CPU of node 0, ...., interleaves memory on node 0, 1, ...  Then
> compare the different behavior (including memory bandwidth) with node
> and memory-tier based solution.

ah, I see.

Example 1: A single-socket system with multiple CXL memory devices
===
CPU Node: node0
CXL Nodes: node1, node2

Bandwidth attributes (in theory):
node0 - 8 channels - ~307GB/s
node1 - x16 link - 64GB/s
node2 - x8 link - 32GB/s

In a system like this, the optimal distribution of memory on an
interleave for maximizing bandwidth is about 76%/16%/8%.

for the sake of simplicity:  --weighted-interleave=0:76,1:16,0:8
but realistically we could make the weights sysfs values in the node

Regardless of the mechanism to engage this, the most effective way to
capture this in the system is by applying weights to nodes, not tiers.
If done in tiers, each node would be assigned to its own tier, making
the mechanism equivalent. So you might as well simplify the whole thing
and chop the memtier component out.

Is this configuration realistic? *shrug* - technically possible. And in
fact most hardware or driver based interleaving mechanisms would not
really be able to manage an interleave region across these nodes, at
least not without placing the x16 driver in x8 mode, or just having the
wrong distribution %'s.

Example 2: A dual-socket system with 1 CXL device per socket
===
CPU Nodes: node0, node1
CXL Nodes: node2, node3 (on sockets 0 and 1 respective)

Bandwidth Attributes (in theory):
nodes 0 & 1 - 8 channels - ~307GB/s ea.
nodes 2 & 3 - x16 link - 64GB/s ea.

This is similar to example #1, but with one difference:  A task running
on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
This is because on access to nodes 1 and 3, the cross-socket link (UPI,
or whatever AMD calls it) becomes a bandwidth chokepoint.

So from the perspective of node 0, the "real total" available bandwidth
is about 307GB+64GB+(41.6GB * UPI Links) in the case of intel.  so the
best result you could get is around 307+64+164=535GB/s if you have the
full 4 links.

You'd want to distribute the cross-socket traffic proportional to UPI,
not the total.

This leaves us with weights of:

node0 - 57%
node1 - 26%
node2 - 12%
node3 - 5%

Again, naturally nodes are the place to carry the weights here. In this
scenario, placing it in memory-tiers would require that 1 tier per node
existed.

Example 3: A single-socket system with 2 CXL devices
===
Different than example 1: Both devices are the same.

CPU Node: node0
CXL Nodes: node1, node2

Bandwidth attributes (in theory):
node0 - 8 channels - ~307GB/s
node1/2 - x16 link - 64GB/s

In this case you want the weights to be: 70/15/15% respectively

Caveat: A user may, in fact, use the CXL driver to implement a single
node which naturally interleaves the 2 devices. In this case it's the
same as a 1-socket, 1-device setup which is trivially 1-node-per-tier,
and therefore weights should live with nodes.

In the case of a single memory tier, you could simply make this 70/30.

However, and this the the argument against placing it in the
memory-tier: the user is free to hack-off any of the chosen numa nodes
via mempolicy, which makes the aggregated weight meaningless.

Example:  --weighted-interleave=0,1

Under my current code, if I set the weights to 70/30 in the memory-tiers
code, the result is that node1 inherits the full 30% defined in the
tier, which leads to a suboptimal distribution.  What you actually want
in this case is about 83/17% split.

However, this also presents an issue for placing the weights in the
nodes:  A node weight is meaningless outside the context of the active
context.  If I have 2 nodes and i set their weights to 70/30, and I hack
off node1, i can't have 70% of all memory go to node0, I have to send
100% of the memory to node0 - making the weight functionally
meaningless.

So this would imply a single global weight set on the nodes is ALSO a
bad solution, and instead it would be better to extend set_mempolicy
to have a --weighted-interleave option that reads HMAT/CDAT provided
bandwidth data and generates the weights for the selected nodes as
part of the policy.

The downside of this is that if the HMAT/CDAT data is garbage, the
policy becomes garbage.  To mitigate this, we should consider allowing
userland to override those values explicitly for the purpose of weighted
interleave should the HMAT/CDAT information be garbage/lies.

Another downside to this is that nodemask changes require recalculation
of weights, which may introduce some racey conditions, but that can
probably be managed.

Should we carry weights in node, memtier, or mempolicy?
===
The current analysis suggest carrying it in memory-tier would simply
require memory-tier to make 1 tier per node - which may or may not
be consistent with other goals of the memtier subsystem.

The pros of placing a weight in node is that the "effective" weight in
the context of the entire system can be calculated at the time nodes are
created.  If, at interleave time, the interface required a node+nodemask
then it's probably preferable to forego manual weighting, and simply
calculate based on HMAT/CDAT data.

The downside of placing it in nodes is that mempolicy is free to set the
interleave set to some combination of nodes, and this would prevent any
nodes created after process launch from being used in the interleave set
unless the software detected the hotplug event.  I don't know how much
of a real concern this is, but it is a limitation.

The other option is to add --weighted-interleave, but have mempolicy
generate the weights based on node-provided CDAT/HMAT data (or
overrides), which keep almost everything inside of mempolicy except for
a couple of interfaces to drivers/base/node.c that allow querying of
that data.

Summarize:
===
The weights are actually a function of bandwidth, and can probably be
calculated on the fly - rather than being manually set. However, we may
want to consider allowing the bandwidth attributes being exposed by
CDAT/HMAT to be overridden should the user discover they are functionally
incorrect. (for reference: I have seen this myself, where a device
published 5GB/s but actually achieves 22GB/s).

for reference, they are presently RO:

static DEVICE_ATTR_RO(property)
ACCESS_ATTR(read_bandwidth);
ACCESS_ATTR(read_latency);
ACCESS_ATTR(write_bandwidth);
ACCESS_ATTR(write_latency);

~Gregory