+ gfp_thisnode-for-the-slab-allocator-v2.patch added to -mm tree

akpm@xxxxxxxx · Fri, 15 Sep 2006 15:21:59 -0700

The patch titled

     GFP_THISNODE for the slab allocator

has been added to the -mm tree.  Its filename is

     gfp_thisnode-for-the-slab-allocator-v2.patch

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: GFP_THISNODE for the slab allocator
From: Christoph Lameter <clameter@xxxxxxx>

This patch insures that the slab node lists in the NUMA case only contain
slabs that belong to that specific node.  All slab allocations use
GFP_THISNODE when calling into the page allocator.  If an allocation fails
then we fall back in the slab allocator according to the zonelists appropriate
for a certain context.

This allows a replication of the behavior of alloc_pages and alloc_pages node
in the slab layer.

Currently allocations requested from the page allocator may be redirected via
cpusets to other nodes.  This results in remote pages on nodelists and that in
turn results in interrupt latency issues during cache draining.  Plus the slab
is handing out memory as local when it is really remote.

Fallback for slab memory allocations will occur within the slab allocator and
not in the page allocator.  This is necessary in order to be able to use the
existing pools of objects on the nodes that we fall back to before adding more
pages to a slab.

The fallback function insures that the nodes we fall back to obey cpuset
restrictions of the current context.  We do not allocate objects from outside
of the current cpuset context like before.

Note that the implementation of locality constraints within the slab allocator
requires importing logic from the page allocator.  This is a mischmash that is
not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
face similar problems and have similar minimal reimplementations of the basic
fallback logic of the page allocator.  There is another way of implementing a
slab by avoiding per node lists (see modular slab) but this wont work within
the existing slab.

V1->V2:
- Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
- Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
  #ifdef

Signed-off-by: Christoph Lameter <clameter@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
---

 mm/slab.c |   67 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 59 insertions(+), 8 deletions(-)

diff -puN mm/slab.c~gfp_thisnode-for-the-slab-allocator-v2 mm/slab.c

--- a/mm/slab.c~gfp_thisnode-for-the-slab-allocator-v2
+++ a/mm/slab.c
@@ -1564,7 +1564,13 @@ static void *kmem_getpages(struct kmem_c
 	 */
 	flags |= __GFP_COMP;
 #endif
-	flags |= cachep->gfpflags;
+
+	/*
+	 * Under NUMA we want memory on the indicated node. We will handle
+	 * the needed fallback ourselves since we want to serve from our
+	 * per node object lists first for other nodes.
+	 */
+	flags |= cachep->gfpflags | GFP_THISNODE;
 
 	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
 	if (!page)
@@ -3053,16 +3059,23 @@ static __always_inline void *__cache_all
 
 	local_irq_save(save_flags);
 
-#ifdef CONFIG_NUMA
-	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+	if (unlikely(NUMA_BUILD &&
+			(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))) {
 		objp = alternate_node_alloc(cachep, flags);
 		if (objp != NULL)
 			goto out;
 	}
-#endif
 
 	objp = ____cache_alloc(cachep, flags);
 out:
+
+	/*
+	 * We may just have run out of memory on the local know.
+	 * __cache_alloc_node knows how to locate memory on other nodes
+	 */
+ 	if (NUMA_BUILD && !objp)
+ 		objp = __cache_alloc_node(cachep, flags, numa_node_id());
+
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
 					    caller);
@@ -3081,7 +3094,7 @@ static void *alternate_node_alloc(struct
 {
 	int nid_alloc, nid_here;
 
-	if (in_interrupt())
+	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_node_id();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
@@ -3094,6 +3107,28 @@ static void *alternate_node_alloc(struct
 }
 
 /*
+ * Fallback function if there was no memory available and no objects on a
+ * certain node and we are allowed to fall back. We mimick the behavior of
+ * the page allocator. We fall back according to a zonelist determined by
+ * the policy layer while obeying cpuset constraints.
+ */
+void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
+{
+	struct zonelist *zonelist = &NODE_DATA(slab_node(current->mempolicy))
+					->node_zonelists[gfp_zone(flags)];
+	struct zone **z;
+	void *obj = NULL;
+
+	for (z = zonelist->zones; *z && !obj; z++)
+		if (zone_idx(*z) <= ZONE_NORMAL &&
+				cpuset_zone_allowed(*z, flags))
+			obj = __cache_alloc_node(cache,
+					flags | __GFP_THISNODE,
+					zone_to_nid(*z));
+	return obj;
+}
+
+/*
  * A interface to enable slab creation on nodeid
  */
 static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
@@ -3146,14 +3181,30 @@ retry:
 must_grow:
 	spin_unlock(&l3->list_lock);
 	x = cache_grow(cachep, flags, nodeid);
+	if (x)
+		goto retry;
 
-	if (!x)
-		return NULL;
+	if (!(flags & __GFP_THISNODE))
+		/* Unable to grow the cache. Fall back to other nodes. */
+		return fallback_alloc(cachep, flags);
+
+	return NULL;
 
-	goto retry;
 done:
 	return obj;
 }
+#else
+static inline void *__cache_alloc_node(struct kmem_cache *cachep,
+		 gfp_t flags, int nodeid)
+{
+	return NULL;
+}
+
+static inline void *alternate_node_alloc(struct kmem_cache *cachep,
+		gfp_t flags)
+{
+	return NULL;
+}
 #endif
 
 /*
_

Patches currently in -mm which might be from clameter@xxxxxxx are

fix-longstanding-load-balancing-bug-in-the-scheduler.patch
cleanup-radix_tree_derefreplace_slot-calling-conventions-warning-fixes.patch
reduce-max_nr_zones-remove-two-strange-uses-of-max_nr_zones.patch
reduce-max_nr_zones-fix-max_nr_zones-array-initializations.patch
reduce-max_nr_zones-make-display-of-highmem-counters-conditional-on-config_highmem.patch
reduce-max_nr_zones-make-display-of-highmem-counters-conditional-on-config_highmem-tidy.patch
reduce-max_nr_zones-move-highmem-counters-into-highmemc-h.patch
reduce-max_nr_zones-move-highmem-counters-into-highmemc-h-fix.patch
reduce-max_nr_zones-page-allocator-zone_highmem-cleanup.patch
reduce-max_nr_zones-use-enum-to-define-zones-reformat-and-comment.patch
reduce-max_nr_zones-use-enum-to-define-zones-reformat-and-comment-cleanup.patch
reduce-max_nr_zones-make-zone_dma32-optional.patch
reduce-max_nr_zones-make-zone_highmem-optional.patch
reduce-max_nr_zones-make-zone_highmem-optional-fix.patch
reduce-max_nr_zones-make-zone_highmem-optional-fix-fix.patch
reduce-max_nr_zones-remove-display-of-counters-for-unconfigured-zones.patch
reduce-max_nr_zones-fix-i386-srat-check-for-max_nr_zones.patch
mempolicies-fix-policy_zone-check.patch
apply-type-enum-zone_type.patch
apply-type-enum-zone_type-fix.patch
linearly-index-zone-node_zonelists.patch
slab-respect-architecture-and-caller-mandated-alignment.patch
slab-optimize-kmalloc_node-the-same-way-as-kmalloc.patch
slab-optimize-kmalloc_node-the-same-way-as-kmalloc-fix.patch
slab-extract-__kmem_cache_destroy-from-kmem_cache_destroy.patch
slab-do-not-panic-when-alloc_kmemlist-fails-and-slab-is-up.patch
add-__gfp_thisnode-to-avoid-fallback-to-other-nodes-and-ignore.patch
add-__gfp_thisnode-to-avoid-fallback-to-other-nodes-and-ignore-fix.patch
sys_move_pages-do-not-fall-back-to-other-nodes.patch
guarantee-that-the-uncached-allocator-gets-pages-on-the-correct.patch
cleanup-add-zone-pointer-to-get_page_from_freelist.patch
profiling-require-buffer-allocation-on-the-correct-node.patch
define-easier-to-handle-gfp_thisnode.patch
optimize-free_one_page.patch
do-not-check-unpopulated-zones-for-draining-and-counter.patch
extract-the-allocpercpu-functions-from-the-slab-allocator.patch
replace-min_unmapped_ratio-by-min_unmapped_pages-in-struct-zone.patch
zvc-support-nr_slab_reclaimable--nr_slab_unreclaimable.patch
zone_reclaim-dynamic-slab-reclaim.patch
zone_reclaim-dynamic-slab-reclaim-tidy.patch
zone-reclaim-with-slab-avoid-unecessary-off-node-allocations.patch
hugepages-use-page_to_nid-rather-than-traversing-zone-pointers.patch
numa-add-zone_to_nid-function.patch
numa-add-zone_to_nid-function-update.patch
slab-fix-kmalloc_node-applying-memory-policies-if-nodeid-==-numa_node_id.patch
add-numa_build-definition-in-kernelh-to-avoid-ifdef.patch
disable-gfp_thisnode-in-the-non-numa-case.patch
gfp_thisnode-for-the-slab-allocator-v2.patch
x86-implement-always-locked-bit-ops-for-memory-shared-with-an-smp-hypervisor.patch
zvc-support-nr_slab_reclaimable--nr_slab_unreclaimable-swap_prefetch.patch
reduce-max_nr_zones-swap_prefetch-remove-incorrect-use-of-zone_highmem.patch
numa-add-zone_to_nid-function-swap_prefetch.patch
readahead-state-based-method-aging-accounting-apply-type-enum-zone_type-readahead.patch

-
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html