Weighted interleave added a sysfs interface for users to change the interleave weights based on user input - with a default value of `1` until reasonable system default code could be agreed upon. This RFC series will suggest and solicit ideas for how to generate these system defaults, and lay out some challenges in generating them. Future work on the CXL driver (drivers/cxl) will introduce additional code which registers HMAT information for hotplug memory provided by CXL devices. This RFC does not presently provide that integration, but will after it is upstream. Interfaces introduced: - mempolicy_set_node_perf Called when HMAT data for a node is reported to the system Integration points: - node_set_perf_attrs - for reporting bandwidth info to mempolicy - get_il_weight and weighted interleave allocation interfaces to provide system defaults when applying weighted interleave. New data in mempolicy: - node_bw_table - cached bandwidth information about each node - default_iw_table - the system default interleave weights Note that because there are now multiple tables (default and sysfs), the allocators fetch each weight individually, rather than via memcpy. This means if weights change at runtime (extremely unlikely), the allocators may temporarily see an "incorrect distribution" while the system is being reweighted. This is not harmful (simply inaccurate) and a result of providing a clean way to revert to the system default. v1: Simple GCD reduction of basic bandwidth distribution. Approach: - whenever new coordinates are reported, recalculate all weights - cache each node's min(read, write) bandwidth - calculate the percentage each node's bandwidth is of the whole - use GCD to reduce all percentages down to the minimum possible The approach is simple and fast, and operates well under reasonably well if the numbers reported by HMAT for each node happen to land on easily reducable percentages. For example, a system presenting 88% of its bandwidth on DRAM and 11% of its bandwidth on CXL (floored for simplicity) will end up with default weights of (8:1), which is a preferably small number assigned in each weight. The downside of this approach is that it is susceptible to prime and co-prime numbers keeping interleave weights large (e.g. 89:11 vs 8:1). We prefer finer grained interleaves to prevent large swaths of contiguous memory from landing on the same device. Additionally, this also hides the fact that multi-socket systems experience chokepoints across sockets. For example a 2-socket system with 200GB/s on each socket from DDR does not mean a given socket has an aggregate of 400GB/s of bandwidth. Interconnects between sockets provide less aggregate bandwidth than the DDR they provide access to (e.g. 3 UPI lanes vs 8 DDR channels). So this approach will reduce multi-socket interleave weights to (1:1) by default if all sockets provide the same bandwidth. Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx> Gregory Price (1): mm/mempolicy: introduce system default interleave weights drivers/acpi/numa/hmat.c | 1 + drivers/base/node.c | 7 +++ include/linux/mempolicy.h | 4 ++ mm/mempolicy.c | 129 ++++++++++++++++++++++++++++++-------- 4 files changed, 116 insertions(+), 25 deletions(-) -- 2.39.1