[PATCH 2/3] hugetlb: derive huge pages nodes allowed from task mempolicy

Lee Schermerhorn <lee.schermerhorn@xxxxxx> · Wed, 08 Jul 2009 15:24:46 -0400

[PATCH 2/3] hugetlb:  derive huge pages nodes allowed from task mempolicy

Against: 25jun09 mmotm atop the "hugetlb: balance freeing..." series

V2:
+ cleaned up comments, removed some deemed unnecessary,
  add some suggested by review
+ removed check for !current in huge_mpol_nodes_allowed().
+ added 'current->comm' to warning message in huge_mpol_nodes_allowed().
+ added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
  catch out of range node id.
+ add examples to patch description

This patch derives a "nodes_allowed" node mask from the numa
mempolicy of the task modifying the number of persistent huge
pages to control the allocation, freeing and adjusting of surplus
huge pages.  This mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced. 
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

Because we may have allocated or freed a huge page with a 
different policy/nodes_allowed previously, we always need to
check that the next_node_to_{alloc|free} exists in the current
nodes_allowed mask.  To avoid duplication of code, this is done
in the hstate_next_node_to_{alloc|free}() functions.  So,
these functions have been modified to allow them to be called
to obtain the "start_nid".  Then, whereas prior to this patch
we unconditionally called hstate_next_node_to_{alloc|free}(),
whether or not we successfully allocated/freed a huge page on
the node, now we only call these functions on failure to alloc/free.

Notes:

1) This patch introduces a subtle change in behavior:  huge page
   allocation and freeing will be constrained by any mempolicy
   that the task adjusting the huge page pool inherits from its
   parent.  This policy could come from a distant ancestor.  The
   adminstrator adjusting the huge page pool without explicitly
   specifying a mempolicy via numactl might be surprised by this.
   Additionaly, any mempolicy specified by numactl will be
   constrained by the cpuset in which numactl is invoked.

2) Hugepages allocated at boot time use the node_online_map.
   An additional patch could implement a temporary boot time
   huge pages nodes_allowed command line parameter.

3) Using mempolicy to control persistent huge page allocation
   and freeing requires no change to hugeadm when invoking
   it via numactl, as shown in the examples below.  However,
   hugeadm could be enhanced to take the allowed nodes as an
   argument and set its task mempolicy itself.  This would allow
   it to detect and warn about any non-default mempolicy that it
   inherited from its parent, thus alleviating the issue described
   in Note 1 above.

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	hugeadm --pool-pages-min=2048Kb:32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the 
condition above, with 8 huge pages per node:

	numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx>

 include/linux/mempolicy.h |    3 +
 mm/hugetlb.c              |  101 +++++++++++++++++++++++++++++++---------------
 mm/mempolicy.c            |   61 +++++++++++++++++++++++++++
 3 files changed, 133 insertions(+), 32 deletions(-)

Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
===================================================================

--- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-07-07 09:58:17.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-07-07 13:25:41.000000000 -0400
@@ -621,29 +621,54 @@ static struct page *alloc_fresh_huge_pag
 }
 
 /*
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge pages based on a different
+ * nodes_allowed, previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use the next
+ * allowed node for alloc or free.
+ */
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	nid = next_node(nid, *nodes_allowed);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(*nodes_allowed);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+static int this_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
+/*
  * Use a helper variable to find the next node and then
  * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
  * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
  * doesn't matter if occasionally a racer chooses the
- * same nid as we do.  Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
+ * same nid as we do.  Move nid forward in the mask whether
+ * or not we just successfully allocated a hugepage so that
+ * the next allocation addresses the next node.
  */
 static int hstate_next_node_to_alloc(struct hstate *h,
 					nodemask_t *nodes_allowed)
 {
-	int next_nid;
+	int nid, next_nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
-	next_nid = next_node(h->next_nid_to_alloc, *nodes_allowed);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(*nodes_allowed);
+	nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+
+	next_nid = next_node_allowed(nid, nodes_allowed);
 	h->next_nid_to_alloc = next_nid;
-	return next_nid;
+
+	return nid;
 }
 
 static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
@@ -653,15 +678,17 @@ static int alloc_fresh_huge_page(struct 
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_alloc;
+	start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
 		page = alloc_fresh_huge_page_node(h, next_nid);
-		if (page)
+		if (page) {
 			ret = 1;
+			break;
+		}
 		next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
-	} while (!page && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -672,21 +699,23 @@ static int alloc_fresh_huge_page(struct 
 }
 
 /*
- * helper for free_pool_huge_page() - find next node
- * from which to free a huge page
+ * helper for free_pool_huge_page() - return the next node
+ * from which to free a huge page.  Advance the next node id
+ * whether or not we find a free huge page to free so that the
+ * next attempt to free addresses the next node.
  */
 static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 {
-	int next_nid;
+	int nid, next_nid;
 
 	if (!nodes_allowed)
 		nodes_allowed = &node_online_map;
 
-	next_nid = next_node(h->next_nid_to_free, *nodes_allowed);
-	if (next_nid == MAX_NUMNODES)
-		next_nid = first_node(*nodes_allowed);
+	nid = this_node_allowed(h->next_nid_to_free, nodes_allowed);
+	next_nid = next_node_allowed(nid, nodes_allowed);
 	h->next_nid_to_free = next_nid;
-	return next_nid;
+
+	return nid;
 }
 
 /*
@@ -702,7 +731,7 @@ static int free_pool_huge_page(struct hs
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->next_nid_to_free;
+	start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
@@ -724,9 +753,10 @@ static int free_pool_huge_page(struct hs
 			}
 			update_and_free_page(h, page);
 			ret = 1;
+			break;
 		}
  		next_nid = hstate_next_node_to_free(h, nodes_allowed);
-	} while (!ret && next_nid != start_nid);
+	} while (next_nid != start_nid);
 
 	return ret;
 }
@@ -1037,10 +1067,9 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->next_nid_to_alloc),
+				NODE_DATA(hstate_next_node_to_alloc(h, NULL)),
 				huge_page_size(h), huge_page_size(h), 0);
 
-		hstate_next_node_to_alloc(h, NULL); /* always advance nid */
 		if (addr) {
 			/*
 			 * Use the beginning of the huge page to store the
@@ -1177,29 +1206,33 @@ static int adjust_pool_surplus(struct hs
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0)
-		start_nid = h->next_nid_to_alloc;
+		start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 	else
-		start_nid = h->next_nid_to_free;
+		start_nid = hstate_next_node_to_free(h, nodes_allowed);
 	next_nid = start_nid;
 
 	do {
 		int nid = next_nid;
 		if (delta < 0)  {
-			next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 			/*
 			 * To shrink on this node, there must be a surplus page
 			 */
-			if (!h->surplus_huge_pages_node[nid])
+			if (!h->surplus_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_alloc(h,
+								nodes_allowed);
 				continue;
+			}
 		}
 		if (delta > 0) {
-			next_nid = hstate_next_node_to_free(h, nodes_allowed);
 			/*
 			 * Surplus cannot exceed the total number of pages
 			 */
 			if (h->surplus_huge_pages_node[nid] >=
-						h->nr_huge_pages_node[nid])
+						h->nr_huge_pages_node[nid]) {
+				next_nid = hstate_next_node_to_free(h,
+								nodes_allowed);
 				continue;
+			}
 		}
 
 		h->surplus_huge_pages += delta;
@@ -1215,10 +1248,13 @@ static int adjust_pool_surplus(struct hs
 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
 {
 	unsigned long min_count, ret;
+	nodemask_t *nodes_allowed;
 
 	if (h->order >= MAX_ORDER)
 		return h->max_huge_pages;
 
+	nodes_allowed = huge_mpol_nodes_allowed();
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -1232,7 +1268,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, -1))
+		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
 	}
 
@@ -1243,7 +1279,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, NULL);
+		ret = alloc_fresh_huge_page(h, nodes_allowed);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1269,16 +1305,17 @@ static unsigned long set_max_huge_pages(
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h, NULL, 0))
+		if (!free_pool_huge_page(h, nodes_allowed, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {
-		if (!adjust_pool_surplus(h, NULL, 1))
+		if (!adjust_pool_surplus(h, nodes_allowed, 1))
 			break;
 	}
 out:
 	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
+	kfree(nodes_allowed);
 	return ret;
 }
 
Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/mempolicy.c	2009-07-07 09:46:48.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c	2009-07-07 13:48:06.000000000 -0400
@@ -1544,6 +1544,67 @@ struct zonelist *huge_zonelist(struct vm
 	}
 	return zl;
 }
+
+/*
+ * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
+ *
+ * Returns a [pointer to a] nodelist based on the current task's mempolicy
+ * to constraing the allocation and freeing of persistent huge pages
+ * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
+ * 'bind' policy in this context.  An attempt to allocate a persistent huge
+ * page will never "fallback" to another node inside the buddy system
+ * allocator.
+ *
+ * If the task's mempolicy is "default" [NULL], just return NULL for
+ * default behavior.  Otherwise, extract the policy nodemask for 'bind'
+ * or 'interleave' policy or construct a nodemask for 'preferred' or
+ * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
+ *
+ * N.B., it is the caller's responsibility to free a returned nodemask.
+ */
+nodemask_t *huge_mpol_nodes_allowed(void)
+{
+	nodemask_t *nodes_allowed = NULL;
+	struct mempolicy *mempolicy;
+	int nid;
+
+	if (!current->mempolicy)
+		return NULL;
+
+	mpol_get(current->mempolicy);
+	nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
+	if (!nodes_allowed) {
+		printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
+			"for huge page allocation.\nFalling back to default.\n",
+			current->comm);
+		goto out;
+	}
+	nodes_clear(*nodes_allowed);
+
+	mempolicy = current->mempolicy;
+	switch(mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			nid = numa_node_id();
+		else
+			nid = mempolicy->v.preferred_node;
+		node_set(nid, *nodes_allowed);
+		break;
+
+	case MPOL_BIND:
+		/* Fall through */
+	case MPOL_INTERLEAVE:
+		*nodes_allowed =  mempolicy->v.nodes;
+		break;
+
+	default:
+		BUG();
+	}
+
+out:
+	mpol_put(current->mempolicy);
+	return nodes_allowed;
+}
 #endif
 
 /* Allocate a page in interleaved policy.
Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/mempolicy.h	2009-07-06 13:05:23.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h	2009-07-07 09:58:32.000000000 -0400
@@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
+extern nodemask_t *huge_mpol_nodes_allowed(void);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
 	return node_zonelist(0, gfp_flags);
 }
 
+static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html