Re: [PATCH 2/3] hugetlb: derive huge pages nodes allowed from task mempolicy

Mel Gorman <mel@xxxxxxxxx> · Thu, 9 Jul 2009 14:38:33 +0100

On Thu, Jul 09, 2009 at 02:30:20PM +0100, Mel Gorman wrote:
> On Wed, Jul 08, 2009 at 03:24:46PM -0400, Lee Schermerhorn wrote:
> > [PATCH 2/3] hugetlb:  derive huge pages nodes allowed from task mempolicy
> > 
> > Against: 25jun09 mmotm atop the "hugetlb: balance freeing..." series
> > 
> > V2:
> > + cleaned up comments, removed some deemed unnecessary,
> >   add some suggested by review
> > + removed check for !current in huge_mpol_nodes_allowed().
> > + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> > + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
> >   catch out of range node id.
> > + add examples to patch description
> > 
> > This patch derives a "nodes_allowed" node mask from the numa
> > mempolicy of the task modifying the number of persistent huge
> > pages to control the allocation, freeing and adjusting of surplus
> > huge pages.  This mask is derived as follows:
> > 
> > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
> >   is produced.  This will cause the hugetlb subsystem to use
> >   node_online_map as the "nodes_allowed".  This preserves the
> >   behavior before this patch.
> > * For "preferred" mempolicy, including explicit local allocation,
> >   a nodemask with the single preferred node will be produced. 
> >   "local" policy will NOT track any internode migrations of the
> >   task adjusting nr_hugepages.
> > * For "bind" and "interleave" policy, the mempolicy's nodemask
> >   will be used.
> > * Other than to inform the construction of the nodes_allowed node
> >   mask, the actual mempolicy mode is ignored.  That is, all modes
> >   behave like interleave over the resulting nodes_allowed mask
> >   with no "fallback".
> > 
> > Because we may have allocated or freed a huge page with a 
> > different policy/nodes_allowed previously, we always need to
> > check that the next_node_to_{alloc|free} exists in the current
> > nodes_allowed mask.  To avoid duplication of code, this is done
> > in the hstate_next_node_to_{alloc|free}() functions.  So,
> > these functions have been modified to allow them to be called
> > to obtain the "start_nid".  Then, whereas prior to this patch
> > we unconditionally called hstate_next_node_to_{alloc|free}(),
> > whether or not we successfully allocated/freed a huge page on
> > the node, now we only call these functions on failure to alloc/free.
> > 
> > Notes:
> > 
> > 1) This patch introduces a subtle change in behavior:  huge page
> >    allocation and freeing will be constrained by any mempolicy
> >    that the task adjusting the huge page pool inherits from its
> >    parent.  This policy could come from a distant ancestor.  The
> >    adminstrator adjusting the huge page pool without explicitly
> >    specifying a mempolicy via numactl might be surprised by this.
> >    Additionaly, any mempolicy specified by numactl will be
> >    constrained by the cpuset in which numactl is invoked.
> > 
> > 2) Hugepages allocated at boot time use the node_online_map.
> >    An additional patch could implement a temporary boot time
> >    huge pages nodes_allowed command line parameter.
> > 
> > 3) Using mempolicy to control persistent huge page allocation
> >    and freeing requires no change to hugeadm when invoking
> >    it via numactl, as shown in the examples below.  However,
> >    hugeadm could be enhanced to take the allowed nodes as an
> >    argument and set its task mempolicy itself.  This would allow
> >    it to detect and warn about any non-default mempolicy that it
> >    inherited from its parent, thus alleviating the issue described
> >    in Note 1 above.
> > 
> > See the updated documentation [next patch] for more information
> > about the implications of this patch.
> > 
> > Examples:
> > 
> > Starting with:
> > 
> > 	Node 0 HugePages_Total:     0
> > 	Node 1 HugePages_Total:     0
> > 	Node 2 HugePages_Total:     0
> > 	Node 3 HugePages_Total:     0
> > 
> > Default behavior [with or without this patch] balances persistent
> > hugepage allocation across nodes [with sufficient contiguous memory]:
> > 
> > 	hugeadm --pool-pages-min=2048Kb:32
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:     8
> > 	Node 3 HugePages_Total:     8
> > 
> > Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> > '--membind' because it allows multiple nodes to be specified
> > and it's easy to type]--we can allocate huge pages on
> > individual nodes or sets of nodes.  So, starting from the 
> > condition above, with 8 huge pages per node:
> > 
> > 	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     8
> > 	Node 1 HugePages_Total:     8
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The incremental 8 huge pages were restricted to node 2 by the
> > specified mempolicy.
> > 
> > Similarly, we can use mempolicy to free persistent huge pages
> > from specified nodes:
> > 
> > 	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
> > 
> > yields:
> > 
> > 	Node 0 HugePages_Total:     4
> > 	Node 1 HugePages_Total:     4
> > 	Node 2 HugePages_Total:    16
> > 	Node 3 HugePages_Total:     8
> > 
> > The 8 huge pages freed were balanced over nodes 0 and 1.
> > 
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
> > 
> 
> Much better changelog.
> 
> Reading through, the main problem I can see is that the reservation
> calculations are still not nodemask aware. This means that mmap() can return
> successfully and the process that called mmap() get unexpectedly killed
> because while there were enough hugepages overall, there were not enough in
> the pools allowed by the nodemask. This is a stability problem for those that
> create hugepage pools on one set of nodes and run applications on a subset.
> Minimally, can this situation be warned about and a note in the documentation
> about it?
> 
> Testing with it, I couldn't break it as such but libhugetlbfs is showing up
> an anomaly with the counters tests. Some investigation showed that it was
> because when it shrinks the pool, one page gets accounted for as a surplus
> page which was unexpected.
> 
> I only got as far as determining the problem was in the patches that free
> pages in a round-robin fashion but then ran out of time on the machine. I'll
> see can I reproduce using fake-numa on a normal x86-64 instead of a real
> NUMA machine but maybe you have a fix for this problem already?
> 

Bah, you did. After I hit send, I remembered you sent one out and I didn't
pick it up properly. The counters tests works as expected now. I'll keep
testing but other than the stability problem when running on a subset of
nodes with hugepages, this looks good to me.

Reviewed-by: Mel Gorman <mel@xxxxxxxxx>

> >  include/linux/mempolicy.h |    3 +
> >  mm/hugetlb.c              |  101 +++++++++++++++++++++++++++++++---------------
> >  mm/mempolicy.c            |   61 +++++++++++++++++++++++++++
> >  3 files changed, 133 insertions(+), 32 deletions(-)
> > 
> > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-07-07 09:58:17.000000000 -0400
> > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-07-07 13:25:41.000000000 -0400
> > @@ -621,29 +621,54 @@ static struct page *alloc_fresh_huge_pag
> >  }
> >  
> >  /*
> > + * common helper functions for hstate_next_node_to_{alloc|free}.
> > + * We may have allocated or freed a huge pages based on a different
> > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
> > + * be outside of *nodes_allowed.  Ensure that we use the next
> > + * allowed node for alloc or free.
> > + */
> > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> > +{
> > +	nid = next_node(nid, *nodes_allowed);
> > +	if (nid == MAX_NUMNODES)
> > +		nid = first_node(*nodes_allowed);
> > +	VM_BUG_ON(nid >= MAX_NUMNODES);
> > +
> > +	return nid;
> > +}
> > +
> > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
> > +{
> > +	if (!node_isset(nid, *nodes_allowed))
> > +		nid = next_node_allowed(nid, nodes_allowed);
> > +	return nid;
> > +}
> > +
> > +/*
> >   * Use a helper variable to find the next node and then
> >   * copy it back to next_nid_to_alloc afterwards:
> >   * otherwise there's a window in which a racer might
> >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> >   * But we don't need to use a spin_lock here: it really
> >   * doesn't matter if occasionally a racer chooses the
> > - * same nid as we do.  Move nid forward in the mask even
> > - * if we just successfully allocated a hugepage so that
> > - * the next caller gets hugepages on the next node.
> > + * same nid as we do.  Move nid forward in the mask whether
> > + * or not we just successfully allocated a hugepage so that
> > + * the next allocation addresses the next node.
> >   */
> >  static int hstate_next_node_to_alloc(struct hstate *h,
> >  					nodemask_t *nodes_allowed)
> >  {
> > -	int next_nid;
> > +	int nid, next_nid;
> >  
> >  	if (!nodes_allowed)
> >  		nodes_allowed = &node_online_map;
> >  
> > -	next_nid = next_node(h->next_nid_to_alloc, *nodes_allowed);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(*nodes_allowed);
> > +	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> > +
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_alloc = next_nid;
> > -	return next_nid;
> > +
> > +	return nid;
> >  }
> >  
> >  static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> > @@ -653,15 +678,17 @@ static int alloc_fresh_huge_page(struct 
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = h->next_nid_to_alloc;
> > +	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	next_nid = start_nid;
> >  
> >  	do {
> >  		page = alloc_fresh_huge_page_node(h, next_nid);
> > -		if (page)
> > +		if (page) {
> >  			ret = 1;
> > +			break;
> > +		}
> >  		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> > -	} while (!page && next_nid != start_nid);
> > +	} while (next_nid != start_nid);
> >  
> >  	if (ret)
> >  		count_vm_event(HTLB_BUDDY_PGALLOC);
> > @@ -672,21 +699,23 @@ static int alloc_fresh_huge_page(struct 
> >  }
> >  
> >  /*
> > - * helper for free_pool_huge_page() - find next node
> > - * from which to free a huge page
> > + * helper for free_pool_huge_page() - return the next node
> > + * from which to free a huge page.  Advance the next node id
> > + * whether or not we find a free huge page to free so that the
> > + * next attempt to free addresses the next node.
> >   */
> >  static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> >  {
> > -	int next_nid;
> > +	int nid, next_nid;
> >  
> >  	if (!nodes_allowed)
> >  		nodes_allowed = &node_online_map;
> >  
> > -	next_nid = next_node(h->next_nid_to_free, *nodes_allowed);
> > -	if (next_nid == MAX_NUMNODES)
> > -		next_nid = first_node(*nodes_allowed);
> > +	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
> > +	next_nid = next_node_allowed(nid, nodes_allowed);
> >  	h->next_nid_to_free = next_nid;
> > -	return next_nid;
> > +
> > +	return nid;
> >  }
> >  
> >  /*
> > @@ -702,7 +731,7 @@ static int free_pool_huge_page(struct hs
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = h->next_nid_to_free;
> > +	start_nid = hstate_next_node_to_free(h, nodes_allowed);
> >  	next_nid = start_nid;
> >  
> >  	do {
> > @@ -724,9 +753,10 @@ static int free_pool_huge_page(struct hs
> >  			}
> >  			update_and_free_page(h, page);
> >  			ret = 1;
> > +			break;
> >  		}
> >   		next_nid = hstate_next_node_to_free(h, nodes_allowed);
> > -	} while (!ret && next_nid != start_nid);
> > +	} while (next_nid != start_nid);
> >  
> >  	return ret;
> >  }
> > @@ -1037,10 +1067,9 @@ int __weak alloc_bootmem_huge_page(struc
> >  		void *addr;
> >  
> >  		addr = __alloc_bootmem_node_nopanic(
> > -				NODE_DATA(h->next_nid_to_alloc),
> > +				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
> >  				huge_page_size(h), huge_page_size(h), 0);
> >  
> > -		hstate_next_node_to_alloc(h, NULL); /* always advance nid */
> >  		if (addr) {
> >  			/*
> >  			 * Use the beginning of the huge page to store the
> > @@ -1177,29 +1206,33 @@ static int adjust_pool_surplus(struct hs
> >  	VM_BUG_ON(delta != -1 && delta != 1);
> >  
> >  	if (delta < 0)
> > -		start_nid = h->next_nid_to_alloc;
> > +		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  	else
> > -		start_nid = h->next_nid_to_free;
> > +		start_nid = hstate_next_node_to_free(h, nodes_allowed);
> >  	next_nid = start_nid;
> >  
> >  	do {
> >  		int nid = next_nid;
> >  		if (delta < 0)  {
> > -			next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
> >  			/*
> >  			 * To shrink on this node, there must be a surplus page
> >  			 */
> > -			if (!h->surplus_huge_pages_node[nid])
> > +			if (!h->surplus_huge_pages_node[nid]) {
> > +				next_nid = hstate_next_node_to_alloc(h,
> > +								nodes_allowed);
> >  				continue;
> > +			}
> >  		}
> >  		if (delta > 0) {
> > -			next_nid = hstate_next_node_to_free(h, nodes_allowed);
> >  			/*
> >  			 * Surplus cannot exceed the total number of pages
> >  			 */
> >  			if (h->surplus_huge_pages_node[nid] >=
> > -						h->nr_huge_pages_node[nid])
> > +						h->nr_huge_pages_node[nid]) {
> > +				next_nid = hstate_next_node_to_free(h,
> > +								nodes_allowed);
> >  				continue;
> > +			}
> >  		}
> >  
> >  		h->surplus_huge_pages += delta;
> > @@ -1215,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
> >  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> >  {
> >  	unsigned long min_count, ret;
> > +	nodemask_t *nodes_allowed;
> >  
> >  	if (h->order >= MAX_ORDER)
> >  		return h->max_huge_pages;
> >  
> > +	nodes_allowed = huge_mpol_nodes_allowed();
> > +
> >  	/*
> >  	 * Increase the pool size
> >  	 * First take pages out of surplus state.  Then make up the
> > @@ -1232,7 +1268,7 @@ static unsigned long set_max_huge_pages(
> >  	 */
> >  	spin_lock(&hugetlb_lock);
> >  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> > -		if (!adjust_pool_surplus(h, NULL, -1))
> > +		if (!adjust_pool_surplus(h, nodes_allowed, -1))
> >  			break;
> >  	}
> >  
> > @@ -1243,7 +1279,7 @@ static unsigned long set_max_huge_pages(
> >  		 * and reducing the surplus.
> >  		 */
> >  		spin_unlock(&hugetlb_lock);
> > -		ret = alloc_fresh_huge_page(h, NULL);
> > +		ret = alloc_fresh_huge_page(h, nodes_allowed);
> >  		spin_lock(&hugetlb_lock);
> >  		if (!ret)
> >  			goto out;
> > @@ -1269,16 +1305,17 @@ static unsigned long set_max_huge_pages(
> >  	min_count = max(count, min_count);
> >  	try_to_free_low(h, min_count);
> >  	while (min_count < persistent_huge_pages(h)) {
> > -		if (!free_pool_huge_page(h, NULL, 0))
> > +		if (!free_pool_huge_page(h, nodes_allowed, 0))
> >  			break;
> >  	}
> >  	while (count < persistent_huge_pages(h)) {
> > -		if (!adjust_pool_surplus(h, NULL, 1))
> > +		if (!adjust_pool_surplus(h, nodes_allowed, 1))
> >  			break;
> >  	}
> >  out:
> >  	ret = persistent_huge_pages(h);
> >  	spin_unlock(&hugetlb_lock);
> > +	kfree(nodes_allowed);
> >  	return ret;
> >  }
> >  
> > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c
> > ===================================================================
> > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/mempolicy.c	2009-07-07 09:46:48.000000000 -0400
> > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c	2009-07-07 13:48:06.000000000 -0400
> > @@ -1544,6 +1544,67 @@ struct zonelist *huge_zonelist(struct vm
> >  	}
> >  	return zl;
> >  }
> > +
> > +/*
> > + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> > + *
> > + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> > + * to constraing the allocation and freeing of persistent huge pages
> > + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> > + * 'bind' policy in this context.  An attempt to allocate a persistent huge
> > + * page will never "fallback" to another node inside the buddy system
> > + * allocator.
> > + *
> > + * If the task's mempolicy is "default" [NULL], just return NULL for
> > + * default behavior.  Otherwise, extract the policy nodemask for 'bind'
> > + * or 'interleave' policy or construct a nodemask for 'preferred' or
> > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> > + *
> > + * N.B., it is the caller's responsibility to free a returned nodemask.
> > + */
> > +nodemask_t *huge_mpol_nodes_allowed(void)
> > +{
> > +	nodemask_t *nodes_allowed = NULL;
> > +	struct mempolicy *mempolicy;
> > +	int nid;
> > +
> > +	if (!current->mempolicy)
> > +		return NULL;
> > +
> > +	mpol_get(current->mempolicy);
> > +	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > +	if (!nodes_allowed) {
> > +		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > +			"for huge page allocation.\nFalling back to default.\n",
> > +			current->comm);
> > +		goto out;
> > +	}
> > +	nodes_clear(*nodes_allowed);
> > +
> > +	mempolicy = current->mempolicy;
> > +	switch(mempolicy->mode) {
> > +	case MPOL_PREFERRED:
> > +		if (mempolicy->flags & MPOL_F_LOCAL)
> > +			nid = numa_node_id();
> > +		else
> > +			nid = mempolicy->v.preferred_node;
> > +		node_set(nid, *nodes_allowed);
> > +		break;
> > +
> > +	case MPOL_BIND:
> > +		/* Fall through */
> > +	case MPOL_INTERLEAVE:
> > +		*nodes_allowed =  mempolicy->v.nodes;
> > +		break;
> > +
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +out:
> > +	mpol_put(current->mempolicy);
> > +	return nodes_allowed;
> > +}
> >  #endif
> >  
> >  /* Allocate a page in interleaved policy.
> > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h
> > ===================================================================
> > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/mempolicy.h	2009-07-06 13:05:23.000000000 -0400
> > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h	2009-07-07 09:58:32.000000000 -0400
> > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
> >  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> >  				unsigned long addr, gfp_t gfp_flags,
> >  				struct mempolicy **mpol, nodemask_t **nodemask);
> > +extern nodemask_t *huge_mpol_nodes_allowed(void);
> >  extern unsigned slab_node(struct mempolicy *policy);
> >  
> >  extern enum zone_type policy_zone;
> > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
> >  	return node_zonelist(0, gfp_flags);
> >  }
> >  
> > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> > +
> >  static inline int do_migrate_pages(struct mm_struct *mm,
> >  			const nodemask_t *from_nodes,
> >  			const nodemask_t *to_nodes, int flags)
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html