On Wed, Apr 13, 2022 at 02:44:34PM -0700, Andrew Morton wrote: > On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@xxxxxxxxxxxxx> wrote: > > > Current implementation to find the demotion targets works > > based on node state N_MEMORY, however some systems may have > > dram only memory numa node which are N_MEMORY but not the > > right choices as demotion targets. > > Why are they not the right choice? Please describe this fully so we > can understand the motivation and end-user benefit of the proposed > change. And please more fully describe the end-user benefits of this > change. Some systems(e.g. PowerVM) have DRAM(fast memory) only NUMA node which are N_MEMORY as well as slow memory(persistent memory) only NUMA node which are also N_MEMORY. As the current demotion target finding algorithm works based on N_MEMORY and best distance, it will choose DRAM only NUMA node as demotion target instead of persistent memory node on such systems. If DRAM only NUMA node is filled with demoted pages then at some point new allocations can start falling to persistent memory, so basically cold pages are in fast memor (due to demotion) and new pages are in slow memory, this is why persistent memory nodes should be utilized for demotion and dram node should be avoided for demotion so that they can be used for new allocations. Current implementation can work fine on the system where the memory only numa nodes are possible only for persistent/slow memory but it is not suitable for the like of systems I have mentioned above. Introduction of this new node state N_DEMOTION_TARGETS will provide the solution to handle demotion for the like of systems I have mentioned, without affecting the existing behavior. > > This patch series introduces the new node state > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > is used to hold the list of nodes which can be used as demotion > > targets, support is also added to set the demotion target > > list from user space so that default behavior can be overridden. > > Permanently extending the kernel ABI is a fairly big deal. Please > fully explain the end-user value, usage scenarios, etc. > > What would go wrong if we simply omitted this interface? I am going to modify this interface according to review feedback in next version, but let me explain why it is needed with examples, Based on topology, and available memory tiers in the system, it may be possible that users don't want to utilize all the demotion targets configured by kernel by default for e.g., 1. To reduce cross socket traffic 2. To use only slowest memory as demotion targets when there are multiple slow memory only nodes available The current patch series handles option 2 above, but doesn't handle option 1 so next version will have that support and might be different implementation to handle such scenarios. Examples 1 ---------- with below NUMA topology, where node 0 & 1 are cpu + dram nodes, node 2 & 3 are equally slower memory only nodes, and node 4 is slowest memory only node, available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: 2 3 node 1 size: n MB node 1 free: n MB node 2 cpus: node 2 size: n MB node 2 free: n MB node 3 cpus: node 3 size: n MB node 3 free: n MB node 4 cpus: node 4 size: n MB node 4 free: n MB node distances: node 0 1 2 3 4 0: 10 20 40 40 80 1: 20 10 40 40 80 2: 40 40 10 40 80 3: 40 40 40 10 80 4: 80 80 80 80 10 This patch series by default prepares below demotion list, node demotion_target 0 3, 2 1 3, 2 2 4 3 4 4 X but It may be possible that user want to utilize node 2 & 3 only for new allocations and only node 4 for demotion. Example 2 --------- with below NUMA topology where Node 0 & 2 are cpu + dram nodes and node 1 is slow memory node near node 0, available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: node 1 size: n MB node 1 free: n MB node 2 cpus: 2 3 node 2 size: n MB node 2 free: n MB node distances: node 0 1 2 0: 10 40 20 1: 40 10 80 2: 20 80 10 This patch series by default prepares below demotion list, node demotion_target 0 1 1 X 2 1 However it may be possible that user may want to avoid node 1 as demotion target for node 2 to reduce cross socket traffic. > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > driver, certain type of memory which registers through dax kmem > > (e.g. HBM) may not be the right choices for demotion so in future > > they should be distinguished based on certain attributes and dax > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > however current implementation also doesn't distinguish any > > such memory and it considers all N_MEMORY as demotion targets > > so this patch series doesn't modify the current behavior. > > > > Current code which sets migration targets is modified in > > this patch series to avoid some of the limitations on the demotion > > target sharing and to use N_DEMOTION_TARGETS only nodes while > > finding demotion targets. > > > > Changelog > > ---------- > > > > v2: > > In v1, only 1st patch of this patch series was sent, which was > > implemented to avoid some of the limitations on the demotion > > target sharing, however for certain numa topology, the demotion > > targets found by that patch was not most optimal, so 1st patch > > in this series is modified according to suggestions from Huang > > and Baolin. Different examples of demotion list comparasion > > between existing implementation and changed implementation can > > be found in the commit message of 1st patch. > > > > Jagdish Gediya (5): > > mm: demotion: Set demotion list differently > > mm: demotion: Add new node state N_DEMOTION_TARGETS > > mm: demotion: Add support to set targets from userspace > > device-dax/kmem: Set node state as N_DEMOTION_TARGETS > > mm: demotion: Build demotion list based on N_DEMOTION_TARGETS > > > > .../ABI/testing/sysfs-kernel-mm-numa | 12 ++++ > > This description is rather brief. Some additional user-facing material > under Documentation/ would help. Describe the format for writing to the > file, what is seen when reading from it, provide a bit of help to the > user so they can understand how to use it, what effects they might see, > etc. Sure, Will do in next version. > > drivers/base/node.c | 4 ++ > > drivers/dax/kmem.c | 2 + > > include/linux/nodemask.h | 1 + > > mm/migrate.c | 67 +++++++++++++++---- > > 5 files changed, 72 insertions(+), 14 deletions(-) >