Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michal Hocko <mhocko@xxxxxxxx> writes:

> On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>> 
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> 	    new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>> 
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>> 
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>> 
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>> 
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node.  Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>> 
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>> 
>>  new_nodes = numa_bitmask_alloc(nr_nodes);
>> 
>>  numa_bitmask_setbit(new_nodes, 1);
>>  numa_bitmask_setbit(new_nodes, 2);
>>  numa_bitmask_setbit(new_nodes, 3);
>> 
>>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
>> 
>>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>> 
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>
> I think you are not really explaining why the home node is really needed
> for that usecase. You can limit memory access to those nodes even
> without the home node. Why the defaulot local node is insufficient is
> really a missing part in the explanation.
>
> One usecase would be cpu less nodes and their preference for the
> allocation. If there are others make sure to mention them in the
> changelog.

Will add this.

>
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>> 
>> Cc: Ben Widawsky <ben.widawsky@xxxxxxxxx>
>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>> Cc: Feng Tang <feng.tang@xxxxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>> Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
>> Cc: Vlastimil Babka <vbabka@xxxxxxx>
>> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
>> Cc: Huang Ying <ying.huang@xxxxxxxxx>
>> Cc: linux-api@xxxxxxxxxxxxxxx
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx>
>> ---
>>  .../admin-guide/mm/numa_memory_policy.rst     | 14 ++++-
>>  include/linux/mempolicy.h                     |  1 +
>>  mm/mempolicy.c                                | 62 +++++++++++++++++++
>>  3 files changed, 76 insertions(+), 1 deletion(-)
>> 
>> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> index 64fd0ba0d057..6eab52d4c3b2 100644
>> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
>> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> @@ -408,7 +408,7 @@ follows:
>>  Memory Policy APIs
>>  ==================
>>  
>> -Linux supports 3 system calls for controlling memory policy.  These APIS
>> +Linux supports 4 system calls for controlling memory policy.  These APIS
>>  always affect only the calling task, the calling task's address space, or
>>  some shared object mapped into the calling task's address space.
>>  
>> @@ -460,6 +460,18 @@ requested via the 'flags' argument.
>>  
>>  See the mbind(2) man page for more details.
>>  
>> +Set home node for a Range of Task's Address Spacec::
>> +
>> +	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>> +  					 unsigned long home_node,
>> +					 unsigned long flags);
>> +
>> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the
>> +task's address range. The system call updates the home node only for the existing
>> +mempolicy range. Other address ranges are ignored.
>
>> A home node is the NUMA node
>> +closest to which page allocation will come from.
>
> I woudl repgrase
> The home node override the default allocation policy to allocate memory
> close to the local node for an executing CPU.
>

ok

> [...]
>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *new;
>> +	unsigned long vmstart;
>> +	unsigned long vmend;
>> +	unsigned long end;
>> +	int err = -ENOENT;
>> +
>> +	if (start & ~PAGE_MASK)
>> +		return -EINVAL;
>> +	/*
>> +	 * flags is used for future extension if any.
>> +	 */
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	if (!node_online(home_node))
>> +		return -EINVAL;
>
> You really want to check the home_node before dereferencing the mask.
>

Any reason why we want to check for home node first?

>> +
>> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
>> +	end = start + len;
>> +
>> +	if (end < start)
>> +		return -EINVAL;
>> +	if (end == start)
>> +		return 0;
>> +	mmap_write_lock(mm);
>> +	vma = find_vma(mm, start);
>> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
>> +
>> +		vmstart = max(start, vma->vm_start);
>> +		vmend   = min(end, vma->vm_end);
>> +		new = mpol_dup(vma_policy(vma));
>> +		if (IS_ERR(new)) {
>> +			err = PTR_ERR(new);
>> +			break;
>> +		}
>> +		/*
>> +		 * Only update home node if there is an existing vma policy
>> +		 */
>> +		if (!new)
>> +			continue;
>
> Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
> supported but you seem to be applying the home node to all existing
> policieso


The restriction is done in policy_node. 

@@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
	}

+	if ((policy->mode == MPOL_BIND ||
+	     policy->mode == MPOL_PREFERRED_MANY) &&
+	    policy->home_node != NUMA_NO_NODE)
+		return policy->home_node;
+
	return nd;
 }




>
>> +		new->home_node = home_node;
>> +		err = mbind_range(mm, vmstart, vmend, new);
>> +		if (err)
>> +			break;
>> +	}
>> +	mmap_write_unlock(mm);
>> +	return err;
>> +}
>> +
>
> Other than that I do not see any major issues.
> -- 
> Michal Hocko
> SUSE Labs


-aneesh



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux