This patchset implements weighted interleave and adds a new sysfs entry: /sys/devices/system/node/nodeN/accessM/il_weight. The il_weight of a node is used by mempolicy to implement weighted interleave when `numactl --interleave=...` is invoked. By default il_weight for a node is always 1, which preserves the default round robin interleave behavior. Interleave weights may be set from 0-100, and denote the number of pages that should be allocated from the node when interleaving occurs. For example, if a node's interleave weight is set to 5, 5 pages will be allocated from that node before the next node is scheduled for allocations. Additionally, "node accessors" (synonmous with cpu nodes) are used to allow for accessor-relative weighting. The "accessor" for a task is defined as the node the task is presently running on. # Set node weight for node0 accessed by tasks on node0 to 5 echo 5 > /sys/devices/system/node/node0/access0/il_weight # Set node weight for node0 accessed by tasks on node1 to 3 echo 3 > /sys/devices/system/node/node0/access1/il_weight In this way it becomes possible to set an interleaving strategy that fits the available bandwidth for the devices available on the system. An example system: Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex In this setup, the effective weights for nodes 0-3 for a task running on Node 0 may be [60, 20, 10, 10]. This spreads memory out across devices which all have different latency and bandwidth attributes at a way that can maximize the available resources. ~Gregory (sorry for the repeat send, automation failure) ================================================================ Version Notes: v3: move weights into node rather than memtiers some additional fixes to node.c to support this v1/v2: add weighted-interleave support to mempolicy = v3 notes This update effectively removes the connection between mempolicy and memory-tiers by simply placing the interleave weights directly in the node accessor information structure. Node was recommended by Huang, Ying Accessor was recommended by Ravi Shankar == Move weights into node Originally this work was done by placing weights in the memory tier. In this patch set we changed the weights to live in the numa node accessor structure, which allows for a more natural weighting scheme and also supports source-node relative weighting. Interleave weight is located in: /sys/devices/system/node/nodeN/accessM/il_weight and is set with a value between 1 and 100: # Set node weight for node0 accessed by node0 to 5 echo 5 > /sys/devices/system/node/node0/access0/il_weight By default, il_weight is always set to 1, which mimics the default interleave behavior (simple round-robin). == Other Node fixes 2 other updates to node.c were required to support this: 1) The access list must be initialized prior to the node struct pointer being registered in the node array 2) The accessor's in the list must be registered regardless of whether HMAT/HMEM information is reported. Presently this results in 0-value information being present in the various access subgroup == Weighted interleave mm/mempolicy: modify interleave mempolicy to use node weights The node subsystem implements interleave weighting for the purpose of bandwidth optimization. Each node may have different weights in relation to each compute node ("access node"). The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement weighted interleave. By default, since all nodes default to a weight of 1, the original interleave behavior is retained. Examples Weight settings: echo 4 > node0/access0/il_weight echo 3 > node1/access0/il_weight echo 2 > node1/access1/il_weight echo 1 > node0/access1/il_weight Results: Task A: cpunode: 0 nodemask: [0,1] weights: [4,3] allocation result: [0,0,0,0,1,1,1 repeat] Task B: cpunode: 1 nodemask: [0,1] weights: [1,2] allocation result: [0,1,1 repeat] === original RFCs ==== Memory-tier based weights By: Ravi Shankar https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@xxxxxxxxxx/ Mempolicy multi-node weighting w/ set_mempolicy2: By: Gregory Price https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/ N:M weighting in mempolicy By: Hasan Al Maruf https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/ Ying Huang's presentation in lpc22, 16th slide in https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\ Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf Gregory Price (4): base/node.c: initialize the accessor list before registering node: add accessors to sysfs when nodes are created node: add interleave weights to node accessor mm/mempolicy: modify interleave mempolicy to use node weights drivers/base/node.c | 120 ++++++++++++++++++++++++++++++++- include/linux/mempolicy.h | 4 ++ include/linux/node.h | 17 +++++ mm/mempolicy.c | 138 +++++++++++++++++++++++++++++--------- 4 files changed, 246 insertions(+), 33 deletions(-) -- 2.39.1