Hi all, I am sending this to pull together notes from a few months of different proposals on mempolicy extensions, and some of the feedback that has been given. There has been some consensus on the implementation of new mempolicy syscalls, which also gives us an opportunity to clean up some warts in the old ones. This summarizes the current plan. Hoping to get some feedback on the interfaces before we get too far into building out the whole thing. Ultimately the extensions proposed here are not too complicated, but locking in syscall interfaces is always contentious. ========================================================== Background: Presently there exist 4 syscall interfaces with which a user can interact with the current task mempolicy, or the current task's vma mempolicies: set_mempolicy : set a task's mempolicy get_mempolicy : 4 possible operations w/ some variants * fetch task->mems_allowed * fetch task->policy (mode, flag, nodemask) * fetch vma->policy for address * possibly the node for that address as well * fetch next interleave node for mempolicy set_mempolicy_home_node : set VMA's home node and possibly migrate mbind : set VMA mempolicy and possibly migrate There have been a few proposals for extending this: 1) make mempolicies modifiable from external tasks (pidfd) - both task and vma policies 2) make mempolicy more extensible (weighted interleave) - requires new fields, and therefore new syscall 3) Clean up set_mempolicy flag warts This RFC proposes the following interfaces set_mempolicy2 get_mempolicy2 get_vma_mempolicy process_set_mempolicy process_get_mempolicy process_get_vma_mempolicy And implies the eventual creation of the following: set_vma_mempolicy process_set_vma_mempolicy This RFC captures 4 different, but related proposals. Proposal 1: new intefaces should reduce flags Proposal 2: making the mempolicies externally modifiable Proposal 3: new syscall interfaces Proposal 4: weighted interleave ========================================================== Proposal 1: Reducing flags in favor of explicit interfaces Right now, set_mempolicy and get_mempolicy are two very different interfaces with very different behaviors based on mode flags and syscall flags. Depending on the flags, they may not even access the mempolicy at all! For example, the behavior of get_mempolicy can be changed to retrieve task->mems_allowed, or replace the policy argument output a node. Both of these are awful warts. One expects a get_mempolicy syscall to... get the mempolicy. Instead, mems_allowed is more of a cgroups/cpuset policy - which while related and affects the mempolicy, is not actually part of (task->mempolicy) or a vma mempolicy. The defines for these flags are also confusing and difficult not intuitive at a glance: Mode flags: or'd into the *mode* in set_mempolicy(mode, ...) #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) #define MPOL_F_NUMA_BALANCING (1 << 13) get_mempolicy Syscall flags: passed into get_mempolicy as an arg #define MPOL_F_NODE (1<<0) #define MPOL_F_ADDR (1<<1) #define MPOL_F_MEMS_ALLOWED (1<<2) I'd like to do away with as many of the operations flags as possible, if not all of them, and define new interfaces with explicit extensions in an argument structure and/or separate syscalls all together to separate out the funtionality. Example: struct mpol_args { int mode; unsigned short mode_flags; /* STATIC/RELATIVE_NODEs, etc */ unsigned long *policy_nodemask; unsigned long policy_maxnode; unsigned long *mems_allowed; /* replace MPOL_F_MEMS_ALLOWED */ unsigned long mems_maxnode; unsigned long policy_node; /* replace MPOL_F_NODE */ unsigned long vma_node; /* replace MPOL_F_NODE */ /* Replace MPOL_F_ADDR with a new syscall */ }; In this setup, the following is true: a) mode and mode_flags operate the same as the original, except that the space is no longer shared. Internally, these fields are stored separately in `struct mempolicy` anyway... so why is the user interface overloading mode? Fix it. b) MPOL_F_ADDR (op_flag) is deprecated, instead implment get_vma_mempolicy(addr, mpol_args, size) (see next section) c) MPOL_F_MEMS_ALLOWED is deprecated. Instead, add the field (args->mems_allowed). If the pointer is set, fetch the mems allowed. If not, don't! (This also lets us refactor the old interface in terms of the new one, which is a bonus for maintainability). d) MPOL_F_NODE is deprecated in favor of adding a policy_node field. policy_node is filled based policy. For example, if the active policy is MPOL_INTERLEAVE, we set policy_node to the next node. if (pol->mode == MPOL_INTERLEAVE) policy_node = next_numa_node(task->il_prev, pol->nodes); e) vma_node is fetched if a vma policy is accessed. vma_node is otherwise ignored by get_mempolicy2, and is only filled by get_vma_mempolicy. again, the inclusion in mpol_args allows us to refactor the old interface in terms of the new one to reduce maintenance issues - but that may come over time. Ultimately this gives us a single extensible struct with flexiblity on how to develop new interfaces around it. Questions: 1) Do we use 1 struct (mpol_args), or multiple. Specifically do we define different structures for get/set, and do we define different structures for task vs vma mempolicies? Here i'm proposing 1 structure, because the core output (mode, flags, nodemask) is the same, but the location from where the data is fetched is different (task vs vma). Additionally, get/set can take the same arg structure and simply ignore the non-relevant fields. For example, the set_* interfaces would ignore (args->mems_allowed), because that's not mutable from this interface. 2) Having handled F_NODE, F_ADDR, and F_MEMS_ALLOWED like above, I don't see the need for (args->op_flags), or a flags argument as part of the syscall interface - but folks love their flag fields - should it stay or be omitted? Personally i think this should be omitted, as it can always be appended later if really needed. ========================================================== Proposal 2: Making mempolicy externally modifiable This has been proposed and suggeted by a variety of sources pidfd_set_mempolicy https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@xxxxxxxxxxxxx/ process_mbind https://lore.kernel.org/linux-mm/ZV50MX4STKRCohiB@xxxxxxxxxxxxxxxxxx/ *_task_mempolicy syscalls https://lore.kernel.org/linux-mm/20231122211200.31620-1-gregory.price@xxxxxxxxxxxx/ https://lore.kernel.org/linux-mm/20231122211200.31620-8-gregory.price@xxxxxxxxxxxx/ Unfortunately the mbind/home_node interfaces have to be deferred to a later patch set, due to the complexities of plumbing the task reference through multiple sub-systems, but this does not majorly affect the interface itself or its arguments, just the internal plumbing. Context on process_set_mempolicy_home_node and process_mbind: https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@xxxxxxxxxxxx/ For this reason, I would forego the implementation of set_vma_mempolicy for the time-being - especially since mbind and set_mempolicy_home_node already exist. The real value here is the process variant, which we can take time to implement separately. So the initial proposal is for the following: /* Set the target task mempolicy */ process_set_mempolicy(int pidf, struct mpol_args *args, size_t size); /* Get the target task mempolicy */ process_get_mempolicy(int pidf, struct mpol_args *args, size_t size); /* Get the target task's vma mempolicy for address */ process_get_vma_mempolicy(int pidf, unsigned long addr, struct mpol_args *args, size_t size); ========================================================== Proposal 3: New Syscalls We are proposing the addition of 6 total new syscall interfaces in the initial RFC, with some implications that we may add 2 more in the future. Current task interfaces: set_mempolicy2(struct mpol_args *args, size_t usize); get_mempolicy2(struct mpol_args *args, size_t usize); get_vma_mempolicy(unsigned long addr, struct mpol_args *args, size_t usize); Remote task interfaces: process_set_mempolicy(int pidfd, struct mpol_args *args, size_t usize); process_get_mempolicy(int pidfd, struct mpol_args *args, size_t usize); process_get_vma_mempolicy(int pidfd, unsigned long addr, struct mpol_args *args, size_t usize); These interfaces allow for the retrieval of task or vma policies, but only allow for the modification of *task* policies. mbind/set_mempolicy_home_node allow for the mempolicy of VMA's to be set for the current task, and the plumbing of remote-task vma policy modification requires much deeper consideration (see proposal 2). ========================================================== Proposal 4: Weighted Interleave This proposal is to create a weighted interleave policy, either as an extension of MPOL_INTERLEAVE, or separetly (MPOL_WEIGHTED_INTERLEAVE). The default weight would be 1 for all possible nodes, so the default behavior of MPOL_WEIGHTED_INTERLEAVE would be MPOL_INTERLEAVE, which was why we eventually chose to implement this just an extension of MPOL_INTERLEAVE in some of the later RFC's below. Basically this is summarized as two basic additions: struct mempolicy { ... unsigned char cur_weight; /* weight of current il node */ unsigned char *interleave_weights; /* size: MAX_NUMNODES */ }; and struct mpol_args { ... unsigned char *interleave_weights; }; By setting the weights of each node in nodemask, it's possible to distribute allocations across those nodes based on the available bandwidth of those nodes. For example node 0 may provide 100GB/s of bandwidth, while node 1 may only provide 50GB/s of bandwidth. In this case, traditional 1:1 interleave is a sub-optimal distribution of memory. Setting the weights to 2:1 would match the bandwidth distribution between nodes (100:50) and therefore be closer to the optimal distribution. Original N:M implementation: https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/ original mempolicy RFC: https://lore.kernel.org/linux-mm/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/ memtier implementation: https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/ node-weight implementation: https://lore.kernel.org/linux-mm/20231031003810.4532-1-gregory.price@xxxxxxxxxxxx/ cgroups/memcg implementation: https://lore.kernel.org/linux-mm/20231109002517.106829-1-gregory.price@xxxxxxxxxxxx/ Summary of some of this work on LWN: https://lwn.net/Articles/948037/ The general feedback across these RFC's has been that there is no consensus on where/whether "global weights" should exist, but consistent consensus that mempolicy->weights seems like a reasonable idea. So we are proposing this extension first, before exploring a global setting more generally. Open Questiosn 1) MPOL_INTERLEAVE extension or new MPOL_WEIGHTED_INTERLEAVE ? 2) Global weight location (not relevant for this set of poposals) cgroup/memcg - "doesn't belong" cgroup/cpusets - locking contension issue node - "too broad" memtier - "doesn't make sense / too broad" new sysfs entry entirely separate from the above? ========================================================== Summarizing Open questions: 1) single arg struct (struct mpol_args) or multiple? 2) do away with operation flags (F_NODE, F_ADDR, F_MEMS_ALLOWED), or keep them? (retained for old interface, just deprecated on new one). 3) Should these syscalls take a flags argument (outside mode_flags) 4) split task and vma mempolicy operations, or single interface for both? 5) Are you grossly offended by any of this? Do you have specific recommendations? 6) Any specific testing requirements you would like to see to make this quick an painless? (ktest, ltp somewhat implied). ========================================================== Capturing all the suggested-by tags for folks who chimed in on prior RFCs and Patches. Suggested-by: Gregory Price <gregory.price@xxxxxxxxxxxx> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> Suggested-by: Hasan Al Maruf <hasanalmaruf@xxxxxx> Suggested-by: Hao Wang <haowang3@xxxxxx> Suggested-by: Ying Huang <ying.huang@xxxxxxxxx> Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx> Suggested-by: Michal Hocko <mhocko@xxxxxxxx> Suggested-by: tj <tj@xxxxxxxxxx> Suggested-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> Suggested-by: Frank van der Linden <fvdl@xxxxxxxxxx> Suggested-by: John Groves <john@xxxxxxxxxxxxxx> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx> Suggested-by: Srinivasulu Thanneeru <sthanneeru@xxxxxxxxxx> Suggested-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx> Suggested-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx> Kind Regards, Gregory Price