[RCF 0/1] mm/mempolicy: weighted interleave system default weights

Gregory Price <gourry.memverge@xxxxxxxxx> · Tue, 20 Feb 2024 15:25:28 -0500

Weighted interleave added a sysfs interface for users to change
the interleave weights based on user input - with a default value
of `1` until reasonable system default code could be agreed upon.

This RFC series will suggest and solicit ideas for how to generate
these system defaults, and lay out some challenges in generating them.

Future work on the CXL driver (drivers/cxl) will introduce additional
code which registers HMAT information for hotplug memory provided
by CXL devices. This RFC does not presently provide that integration,
but will after it is upstream.

Interfaces introduced:
- mempolicy_set_node_perf
  Called when HMAT data for a node is reported to the system

Integration points:
- node_set_perf_attrs - for reporting bandwidth info to mempolicy
- get_il_weight and weighted interleave allocation interfaces to
  provide system defaults when applying weighted interleave.

New data in mempolicy:
- node_bw_table - cached bandwidth information about each node
- default_iw_table - the system default interleave weights

Note that because there are now multiple tables (default and sysfs),
the allocators fetch each weight individually, rather than via memcpy.
This means if weights change at runtime (extremely unlikely), the
allocators may temporarily see an "incorrect distribution" while the
system is being reweighted. This is not harmful (simply inaccurate)
and a result of providing a clean way to revert to the system default.

v1: Simple GCD reduction of basic bandwidth distribution.

Approach:
- whenever new coordinates are reported, recalculate all weights
- cache each node's min(read, write) bandwidth
- calculate the percentage each node's bandwidth is of the whole
- use GCD to reduce all percentages down to the minimum possible

The approach is simple and fast, and operates well under reasonably
well if the numbers reported by HMAT for each node happen to land
on easily reducable percentages.  For example, a system presenting
88% of its bandwidth on DRAM and 11% of its bandwidth on CXL (floored
for simplicity) will end up with default weights of (8:1), which is
a preferably small number assigned in each weight.

The downside of this approach is that it is susceptible to prime and
co-prime numbers keeping interleave weights large (e.g. 89:11 vs 8:1).
We prefer finer grained interleaves to prevent large swaths of
contiguous memory from landing on the same device.

Additionally, this also hides the fact that multi-socket systems
experience chokepoints across sockets.  For example a 2-socket
system with 200GB/s on each socket from DDR does not mean a given
socket has an aggregate of 400GB/s of bandwidth.  Interconnects between
sockets provide less aggregate bandwidth than the DDR they provide
access to (e.g. 3 UPI lanes vs 8 DDR channels).

So this approach will reduce multi-socket interleave weights to (1:1)
by default if all sockets provide the same bandwidth.

Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx>

Gregory Price (1):
  mm/mempolicy: introduce system default interleave weights

 drivers/acpi/numa/hmat.c  |   1 +
 drivers/base/node.c       |   7 +++
 include/linux/mempolicy.h |   4 ++
 mm/mempolicy.c            | 129 ++++++++++++++++++++++++++++++--------
 4 files changed, 116 insertions(+), 25 deletions(-)

-- 
2.39.1