Michal Hocko <mhocko@xxxxxxxx> writes: > On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote: >> This syscall can be used to set a home node for the MPOL_BIND >> and MPOL_PREFERRED_MANY memory policy. Users should use this >> syscall after setting up a memory policy for the specified range >> as shown below. >> >> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, >> new_nodes->size + 1, 0); >> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size, >> home_node, 0); >> >> The syscall allows specifying a home node/preferred node from which kernel >> will fulfill memory allocation requests first. >> >> For address range with MPOL_BIND memory policy, if nodemask specifies more >> than one node, page allocations will come from the node in the nodemask >> with sufficient free memory that is closest to the home node/preferred node. >> >> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node, >> page allocation will come from the node in the nodemask with sufficient >> free memory that is closest to the home node/preferred node. If there is >> not enough memory in all the nodes specified in the nodemask, the allocation >> will be attempted from the closest numa node to the home node in the system. >> >> This helps applications to hint at a memory allocation preference node >> and fallback to _only_ a set of nodes if the memory is not available >> on the preferred node. Fallback allocation is attempted from the node which is >> nearest to the preferred node. >> >> This helps applications to have control on memory allocation numa nodes and >> avoids default fallback to slow memory NUMA nodes. For example a system with >> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory >> >> new_nodes = numa_bitmask_alloc(nr_nodes); >> >> numa_bitmask_setbit(new_nodes, 1); >> numa_bitmask_setbit(new_nodes, 2); >> numa_bitmask_setbit(new_nodes, 3); >> >> p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0); >> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0); >> >> sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0); >> >> This will allocate from nodes closer to node 2 and will make sure kernel will >> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory >> nodes 10, 11 and 12 > > I think you are not really explaining why the home node is really needed > for that usecase. You can limit memory access to those nodes even > without the home node. Why the defaulot local node is insufficient is > really a missing part in the explanation. > > One usecase would be cpu less nodes and their preference for the > allocation. If there are others make sure to mention them in the > changelog. Will add this. > >> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the >> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have >> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which >> ever is closer to node 2. >> >> Cc: Ben Widawsky <ben.widawsky@xxxxxxxxx> >> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> >> Cc: Feng Tang <feng.tang@xxxxxxxxx> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> >> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >> Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> >> Cc: Vlastimil Babka <vbabka@xxxxxxx> >> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx> >> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> >> Cc: Huang Ying <ying.huang@xxxxxxxxx> >> Cc: linux-api@xxxxxxxxxxxxxxx >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> >> --- >> .../admin-guide/mm/numa_memory_policy.rst | 14 ++++- >> include/linux/mempolicy.h | 1 + >> mm/mempolicy.c | 62 +++++++++++++++++++ >> 3 files changed, 76 insertions(+), 1 deletion(-) >> >> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst >> index 64fd0ba0d057..6eab52d4c3b2 100644 >> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst >> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst >> @@ -408,7 +408,7 @@ follows: >> Memory Policy APIs >> ================== >> >> -Linux supports 3 system calls for controlling memory policy. These APIS >> +Linux supports 4 system calls for controlling memory policy. These APIS >> always affect only the calling task, the calling task's address space, or >> some shared object mapped into the calling task's address space. >> >> @@ -460,6 +460,18 @@ requested via the 'flags' argument. >> >> See the mbind(2) man page for more details. >> >> +Set home node for a Range of Task's Address Spacec:: >> + >> + long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, >> + unsigned long home_node, >> + unsigned long flags); >> + >> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the >> +task's address range. The system call updates the home node only for the existing >> +mempolicy range. Other address ranges are ignored. > >> A home node is the NUMA node >> +closest to which page allocation will come from. > > I woudl repgrase > The home node override the default allocation policy to allocate memory > close to the local node for an executing CPU. > ok > [...] > >> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len, >> + unsigned long, home_node, unsigned long, flags) >> +{ >> + struct mm_struct *mm = current->mm; >> + struct vm_area_struct *vma; >> + struct mempolicy *new; >> + unsigned long vmstart; >> + unsigned long vmend; >> + unsigned long end; >> + int err = -ENOENT; >> + >> + if (start & ~PAGE_MASK) >> + return -EINVAL; >> + /* >> + * flags is used for future extension if any. >> + */ >> + if (flags != 0) >> + return -EINVAL; >> + >> + if (!node_online(home_node)) >> + return -EINVAL; > > You really want to check the home_node before dereferencing the mask. > Any reason why we want to check for home node first? >> + >> + len = (len + PAGE_SIZE - 1) & PAGE_MASK; >> + end = start + len; >> + >> + if (end < start) >> + return -EINVAL; >> + if (end == start) >> + return 0; >> + mmap_write_lock(mm); >> + vma = find_vma(mm, start); >> + for (; vma && vma->vm_start < end; vma = vma->vm_next) { >> + >> + vmstart = max(start, vma->vm_start); >> + vmend = min(end, vma->vm_end); >> + new = mpol_dup(vma_policy(vma)); >> + if (IS_ERR(new)) { >> + err = PTR_ERR(new); >> + break; >> + } >> + /* >> + * Only update home node if there is an existing vma policy >> + */ >> + if (!new) >> + continue; > > Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as > supported but you seem to be applying the home node to all existing > policieso The restriction is done in policy_node. @@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); } + if ((policy->mode == MPOL_BIND || + policy->mode == MPOL_PREFERRED_MANY) && + policy->home_node != NUMA_NO_NODE) + return policy->home_node; + return nd; } > >> + new->home_node = home_node; >> + err = mbind_range(mm, vmstart, vmend, new); >> + if (err) >> + break; >> + } >> + mmap_write_unlock(mm); >> + return err; >> +} >> + > > Other than that I do not see any major issues. > -- > Michal Hocko > SUSE Labs -aneesh