This patch set extends the mempolicy interface to enable new mempolicies which may require extended data to operate. One such policy is included with this set as an example: MPOL_WEIGHTED_INTERLEAVE There are 3 major "phases" in the patch set: 1) Implement a "global weight" mechanism via sysfs, which allows set_mempolicy to implement MPOL_WEIGHTED_INTERLEAVE utilizing weights set by the administrator (or system daemon). 2) A refactor of the mempolicy creation mechanism to accept an extensible argument structure `struct mempolicy_args` to promote code re-use between the original mempolicy/mbind interfaces and the new extended mempolicy2/mbind2 interfaces. 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2, along with the addition of task-local weights so that per-task weights can be registered for MPOL_WEIGHTED_INTERLEAVE. ===================================================================== (Patch 1) : sysfs addition - /sys/kernel/mm/mempolicy/ This feature provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ ├── cpu_nodes ├── possible_nodes └── weighted_interleave ├── nodeN │ ├── nodeM │ │ └── weight │ └── nodeM+X │ └── weight └── nodeN+X ├── nodeM │ └── weight └── node+X └── weight 'cpu_nodes' and 'possible_nodes' is added to 'mm/mempolicy' to help describe the expected structures under mempolicy directorys. For example 'cpu_nodes' will describe what 'nodeN' directories will exist in 'weighted_interleave', while 'possible_nodes' describes what nodeM directories wille exist under the 'nodeN' directories. Internally, weights are represented as a matrix of [src,dst] nodes. struct interleave_weight_table { unsigned char weights[MAX_NUMNODES]; }; static struct interleave_weight_table *iw_table; "Source Nodes" are nodes which have 1 or more CPUs, while "Destination Nodes" include any possible node. A "Possible" node is one which has been reserved by the system, but which may or may not be online. We present possible nodes, instead of online nodes, to simplify the management interface, considering that a) the table of MAX_NUMNODES size is allocated anyway to simplfy fetching of weights, and b) it simplifies the management of hotplug events, allowing for weights to be set prior to a node coming online which may be beneficial for immediate use of the memory. the 'weight' of a node (an unsigned char of value 1-255) is the number of pages that are allocated during a "weighted interleave" round. (See 'weighted interleave' for more details'). The [src,dst] matrix is implemented to allow for the capturing the complexity of bandwidth distribution across a multi-socket, or heterogeneous memory environment. For example, consider a 2-socket Intel server with 1 CXL Memory expander attached to each socket. >From the perspective of a task on a CPU in Socket 0, the bandwidth distribution is as follows: Socket 0 DRAM: (# DDR Channels) * (DDR Bandwidth) ~400GB/s Socket 0 CXL : (# CXL Lanes) * (CXL Lane Bandwidth) 128GB/s Socket 1 DRAM + CXL: (# UPI Lanes) * (UPI Bandwidth) ~64GB/s If the task is then migrated to Socket 1, the bandwidth distribution flips to the following. Socket 1 DRAM: (# DDR Channels) * (DDR Bandwidth) ~400GB/s Socket 1 CXL : (# CXL Lanes) * (CXL Lane Bandwidth) 128GB/s Socket 0 DRAM + CXL: (# UPI Lanes) * (UPI Bandwidth) ~64GB/s The matrix allows for a 'source node' perspective weighting strategy, which allows for migrated tasks to simply "re-weight" new allocations immediately, by simply changing the [src] index they access in the global interleave weight table. ===================================================================== (Patch 2) set_mempolicy: MPOL_WEIGHTED_INTERLEAVE Weighted interleave is a new memory policy that interleaves memory across numa nodes in the provided nodemask based on the weights described in patch 1 (sysfs global weights). When a system has multiple NUMA nodes and it becomes bandwidth hungry, the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as having local DRAM and CXL memory together, the current round-robin based interleaving policy doesn't maximize the overall bandwidth because of their different bandwidth characteristics. Instead, the interleaving can be more efficient when the allocation policy follows each NUMA nodes' bandwidth weight rather than having 1:1 round-robin allocation. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which enables weighted interleaving between NUMA nodes. Weighted interleave allows for a proportional distribution of memory across multiple numa nodes, preferablly apportioned to match the bandwidth capacity of each node from the perspective of the accessing node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with a relative bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights will be acquired from the global weight matrix exposed by the sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ The policy will then allocate the number of pages according to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). ===================================================================== (Patches 3-6) Refactoring mempolicy for code-reuse To avoid multiple paths of mempolicy creation, we should refactor the existing code to enable the designed extensibility, and refactor existing users to utilize the new interface (while retaining the existing userland interface). This set of patches introduces a new mempolicy_args structure, which is used to more fully describe a requested mempolicy - to include existing and future extensions. /* * Describes settings of a mempolicy during set/get syscalls and * kernel internal calls to do_set_mempolicy() */ struct mempolicy_args { unsigned short mode; /* policy mode */ unsigned short mode_flags; /* policy mode flags */ nodemask_t *policy_nodes; /* get/set/mbind */ int policy_node; /* get: policy node information */ unsigned long addr; /* get: vma address */ int addr_node; /* get: node the address belongs to */ int home_node; /* mbind: use MPOL_MF_HOME_NODE */ unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ }; This arg structure will eventually be utilized by the following interfaces: mpol_new() - new mempolicy creation do_get_mempolicy() - acquiring information about mempolicy do_set_mempolicy() - setting the task mempolicy do_mbind() - setting a vma mempolicy do_get_mempolicy() is completely refactored to break it out into separate functionality based on the flags provided by get_mempolicy(2) MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed MPOL_F_ADDR: acquires information on vma policies MPOL_F_NODE: changes the output for the policy arg to node info We refactor the get_mempolicy syscall flatten the logic based on these flags, and aloow for set_mempolicy2() to re-use the underlying logic. The result of this refactor, and the new mempolicy_args structure, is that extensions like 'sys_set_mempolicy_home_node' can now be directly integrated into the initial call to 'set_mempolicy2', and that more complete information about a mempolicy can be returned with a single call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy' ===================================================================== (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2 These interfaces are the 'extended' counterpart to their relatives. They use the userland 'struct mpol_args' structure to communicate a complete mempolicy configuration to the kernel. This structure looks very much like the kernel-internal 'struct mempolicy_args': struct mpol_args { /* Basic mempolicy settings */ unsigned short mode; unsigned short mode_flags; unsigned long *pol_nodes; unsigned long pol_maxnodes; /* get_mempolicy: policy node information */ int policy_node; /* get_mempolicy: memory range policy */ unsigned long addr; int addr_node; /* mbind2: policy home node */ int home_node; /* mbind2: address ranges to apply the policy */ struct iovec *vec; size_t vlen; /* weighted interleave settings */ unsigned char *il_weights; /* of size pol_maxnodes */ }; The basic mempolicy settings which are shared across all interfaces are captured at the top of the structure, while extensions such as 'policy_node' and 'addr' are collected beneath. The syscalls are uniform and defined as follows: long sys_mbind2(struct mpol_args *args, size_t size, unsigned long flags); long sys_get_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags); long sys_set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags); The 'flags' argument for mbind2 is the same as 'mbind', except with the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node' field should be utilized. The 'flags' argument for get_mempolicy2 is the same as get_mempolicy. The 'flags' argument is not used by 'set_mempolicy' at this time, but may end up allowing the use of MPOL_MF_HOME_NODE if such functionality is desired. The extensions can be summed up as follows: get_mempolicy2 extensions: 'mode', 'policy_node', and 'addr_node' can now be fetched with a single call, rather than multiple with a combination of flags. - 'mode' will always return the policy mode - 'policy_node' will replace the functionality of MPOL_F_NODE - 'addr_node' will return the node for 'addr' w/ MPOL_F_ADDR set_mempolicy2: - task-local interleave weights can be set via 'il_weights' (see next patch) mbind2: - 'home_node' field sets policy home node w/ MPOL_MF_HOME_NODE - task-local interleave weights can be set via 'il_weights' (see next patch) - 'vec' and 'vlen' can be used to operate on multiple memory ranges, rather than a single memory range per syscall. ===================================================================== (Patch 11) set_mempolicy2/mbind2: MPOL_WEIGHTED_INTERLEAVE This patch shows the explicit extension pattern when adding new policies to mempolicy2/mbind2. This adds the 'il_weights' field to mpol_args and adds the logic to fill in task-local weights. There are now two ways to weight a mempolicy: global and local. To denote which mode the task is in, we add the internal flag: MPOL_F_GWEIGHT /* Utilize global weights */ When MPOL_F_GWEIGHT is set, the global weights are used, and when it is not set, task-local weights are used. Example logic: if (pol->flags & MPOL_F_GWEIGHT) pol_weights = iw_table[numa_node_id()].weights; else pol_weights = pol->wil.weights; set_mempolicy is changed to always set MPOL_F_GWEIGHT, since this syscall is incapable of passing weights via its interfaces, while set_mempolicy2 sets MPOL_F_GWEIGHT if MPOL_F_WEIGHTED_INTERLEAVE is required but (*il_weights) in mpol_args is null. The operation of task-local weighted is otherwise exactly the same - except for what occurs on task migration. On task migration, the system presently has no way of determining what the new weights "should be", or what the user "intended". For this reason, we default all weights to '1' and do not allow weights to be '0'. This means, should a migration occur where one or more nodes appear into the nodemask - the effective weight for that node will be '1'. This avoids a potential allocation failure condition if a migration occurs and introduces a node which otherwise did not have a weight. For this reason, users should use task-local weighting when migrations are not expected, and global weighting when migrations are expected or possible. Suggested-by: Gregory Price <gregory.price@xxxxxxxxxxxx> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> Suggested-by: Hasan Al Maruf <hasanalmaruf@xxxxxx> Suggested-by: Hao Wang <haowang3@xxxxxx> Suggested-by: Ying Huang <ying.huang@xxxxxxxxx> Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx> Suggested-by: Michal Hocko <mhocko@xxxxxxxx> Suggested-by: tj <tj@xxxxxxxxxx> Suggested-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> Suggested-by: Frank van der Linden <fvdl@xxxxxxxxxx> Suggested-by: John Groves <john@xxxxxxxxxxxxxx> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx> Suggested-by: Srinivasulu Thanneeru <sthanneeru@xxxxxxxxxx> Suggested-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx> Suggested-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx> Gregory Price (9): mm/mempolicy: refactor sanitize_mpol_flags for reuse mm/mempolicy: create struct mempolicy_args for creating new mempolicies mm/mempolicy: refactor kernel_get_mempolicy for code re-use mm/mempolicy: allow home_node to be set by mpol_new mm/mempolicy: add userland mempolicy arg structure mm/mempolicy: add set_mempolicy2 syscall mm/mempolicy: add get_mempolicy2 syscall mm/mempolicy: add the mbind2 syscall mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Rakie Kim (2): mm/mempolicy: implement the sysfs-based weighted_interleave interface mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving .../ABI/testing/sysfs-kernel-mm-mempolicy | 33 + ...fs-kernel-mm-mempolicy-weighted-interleave | 35 + .../admin-guide/mm/numa_memory_policy.rst | 85 ++ arch/alpha/kernel/syscalls/syscall.tbl | 3 + arch/arm/tools/syscall.tbl | 3 + arch/m68k/kernel/syscalls/syscall.tbl | 3 + arch/microblaze/kernel/syscalls/syscall.tbl | 3 + arch/mips/kernel/syscalls/syscall_n32.tbl | 3 + arch/mips/kernel/syscalls/syscall_o32.tbl | 3 + arch/parisc/kernel/syscalls/syscall.tbl | 3 + arch/powerpc/kernel/syscalls/syscall.tbl | 3 + arch/s390/kernel/syscalls/syscall.tbl | 3 + arch/sh/kernel/syscalls/syscall.tbl | 3 + arch/sparc/kernel/syscalls/syscall.tbl | 3 + arch/x86/entry/syscalls/syscall_32.tbl | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 3 + arch/xtensa/kernel/syscalls/syscall.tbl | 3 + include/linux/mempolicy.h | 21 + include/linux/syscalls.h | 6 + include/uapi/asm-generic/unistd.h | 8 +- include/uapi/linux/mempolicy.h | 27 +- mm/mempolicy.c | 960 ++++++++++++++++-- 22 files changed, 1103 insertions(+), 114 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave -- 2.39.1