This patchset implements weighted interleave and adds a new cgroup sysfs entry: cgroup/memory.interleave_weights (excluded from root). The il_weight of a node is used by mempolicy to implement weighted interleave when `numactl --interleave=...` is invoked. By default il_weight for a node is always 1, which preserves the default round robin interleave behavior. Interleave weights denote the number of pages that should be allocated from the node when interleaving occurs and have a range of 1-255. The weight of a node can never be 0, and instead the preferred way to prevent allocation is to remove the node from the cpuset or mempolicy altogether. For example, if a node's interleave weight is set to 5, 5 pages will be allocated from that node before the next node is scheduled for allocations. # Set node weight for node 0 to 5 echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights # Set node weight for node 1 to 3 echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights # View the currently set weights cat /sys/fs/cgroup/user.slice/memory.interleave_weights 0:5,1:3 Weights will only be displayed for possible nodes. With this it becomes possible to set an interleaving strategy that fits the available bandwidth for the devices available on the system. An example system: Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex In this setup, the effective weights for a node set of [0,1] may be may be [86, 14] (86% of memory on Node 0, 14% on node 1) or some smaller fraction thereof to encourge quicker rounds for better overall distribution. This spreads memory out across devices which all have different latency and bandwidth attributes in a way that can maximize the available resources. ~Gregory ============= Version Notes: = v4 notes Moved interleave weights to cgroups from nodes. Omitted them from the root cgroup for initial testing/comment, but it seems like it may be a reasonable idea to place them there too. == Weighted interleave mm/mempolicy: modify interleave mempolicy to use node weights The mempolicy MPOL_INTERLEAVE utilizes the node weights defined in the cgroup memory.interleave_weights interfaces to implement weighted interleave. By default, since all nodes default to a weight of 1, the original interleave behavior is retained. ============ RFC History Node based weights By: Gregory Price https://lore.kernel.org/linux-mm/20231031003810.4532-1-gregory.price@xxxxxxxxxxxx/ Memory-tier based weights By: Ravi Shankar https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@xxxxxxxxxx/ Mempolicy multi-node weighting w/ set_mempolicy2: By: Gregory Price https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/ Hasan Al Maruf: N:M weighting in mempolicy https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/ Huang, Ying's presentation in lpc22, 16th slide in https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\ Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf =================== Gregory Price (3): mm/memcontrol: implement memcg.interleave_weights mm/mempolicy: implement weighted interleave Documentation: sysfs entries for cgroup.memory.interleave_weights Documentation/admin-guide/cgroup-v2.rst | 45 +++++ .../admin-guide/mm/numa_memory_policy.rst | 11 ++ include/linux/memcontrol.h | 31 ++++ include/linux/mempolicy.h | 3 + mm/memcontrol.c | 172 ++++++++++++++++++ mm/mempolicy.c | 153 +++++++++++++--- 6 files changed, 387 insertions(+), 28 deletions(-) -- 2.39.1