Re: [PATCH] mm, compaction: direct freepage allocation for async direct compaction

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 22 Nov 2017 15:52:55 +0100

On 11/22/2017 03:33 PM, Johannes Weiner wrote:
> From: Vlastimil Babka <vbabka@xxxxxxx>
> 
> The goal of direct compaction is to quickly make a high-order page available
> for the pending allocation. The free page scanner can add significant latency
> when searching for migration targets, although to succeed the compaction, the
> only important limit on the target free pages is that they must not come from
> the same order-aligned block as the migrated pages.
> 
> This patch therefore makes direct async compaction allocate freepages directly
> from freelists. Pages that do come from the same block (which we cannot simply
> exclude from the freelist allocation) are put on separate list and released
> only after migration to allow them to merge.
> 
> In addition to reduced stall, another advantage is that we split larger free
> pages for migration targets only when smaller pages are depleted, while the
> free scanner can split pages up to (order - 1) as it encouters them. However,
> this approach likely sacrifices some of the long-term anti-fragmentation
> features of a thorough compaction, so we limit the direct allocation approach
> to direct async compaction.
> 
> For observational purposes, the patch introduces two new counters to
> /proc/vmstat. compact_free_direct_alloc counts how many pages were allocated
> directly without scanning, and compact_free_direct_miss counts the subset of
> these allocations that were from the wrong range and had to be held on the
> separate list.
> 
> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> 
> Hi. I'm resending this because we've been struggling with the cost of
> compaction in our fleet, and this patch helps substantially.
> 
> On 128G+ machines, we have seen isolate_freepages_block() eat up 40%
> of the CPU cycles and scanning up to a billion PFNs per minute. Not in
> a spike, but continuously, to service higher-order allocations from
> the network stack, fork (non-vmap stacks), THP, etc. during regular
> operation.
> 
> I've been running this patch on a handful of less-affected but still
> pretty bad machines for a week, and the results look pretty great:
> 
> 	http://cmpxchg.org/compactdirectalloc/compactdirectalloc.png

Thanks a lot, that's very encouraging!

> 
> Note the two different scales - otherwise the compact_free_direct
> lines wouldn't be visible. The free scanner peaks close to 10M pages
> checked per minute, whereas the direct allocations peak at under 180
> per minute, direct misses at 50.
> 
> The work doesn't increase over this period, which is a good sign that
> long-term we're not trending toward worse fragmentation.
> 
> There was an outstanding concern from Joonsoo regarding this patch -
> https://marc.info/?l=linux-mm&m=146035962702122&w=2 - although that
> didn't seem to affect us much in practice.

That concern would be easy to fix, but I was also concerned that if there
are multiple direct compactions in parallel, they might keep too many free
pages isolated away. Recently I resumed work on this and come up with a
different approach, where I put the pages immediately back on tail of
free lists. There might be some downside in more "direct misses".
Also I didn't plan to restrict this to async compaction anymore, because
if it's a better way, we should use it everywhere. So here's how it
looks like now (only briefly tested), we could compare and pick the better
approach, or go with the older one for now and potentially change it later.

----8<----
>From d092c708893823f041004c927b755c6d212a1710 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@xxxxxxx>
Date: Wed, 4 Oct 2017 13:23:56 +0200
Subject: [PATCH] good bye free scanner

---
 include/linux/vm_event_item.h |  1 +
 mm/compaction.c               | 10 ++++--
 mm/internal.h                 |  2 ++
 mm/page_alloc.c               | 71 +++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |  2 ++
 5 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d77bc35278b0..528d8f946907 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -54,6 +54,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 #ifdef CONFIG_COMPACTION
 		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTFREE_LIST_ALLOC, COMPACTFREE_LIST_SKIP,
 		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 		KCOMPACTD_WAKE,
diff --git a/mm/compaction.c b/mm/compaction.c
index b557aac09e92..585740d1ebe5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1169,14 +1169,20 @@ static struct page *compaction_alloc(struct page *migratepage,
 {
 	struct compact_control *cc = (struct compact_control *)data;
 	struct page *freepage;
+	int queued;
 
 	/*
 	 * Isolate free pages if necessary, and if we are not aborting due to
 	 * contention.
 	 */
 	if (list_empty(&cc->freepages)) {
-		if (!cc->contended)
-			isolate_freepages(cc);
+		if (!cc->contended) {
+			queued = alloc_pages_compact(cc->zone, &cc->freepages,
+				cc->nr_migratepages,
+				(cc->migrate_pfn - 1) >> pageblock_order);
+			cc->nr_freepages += queued;
+			map_pages(&cc->freepages);
+		}
 
 		if (list_empty(&cc->freepages))
 			return NULL;
diff --git a/mm/internal.h b/mm/internal.h
index 3e5dc95dc259..ec2be107e0b4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -161,6 +161,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
 }
 
 extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int alloc_pages_compact(struct zone *zone, struct list_head *list,
+				int pages, unsigned long pageblock_exclude);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f43039945148..8a0ee03dafb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2380,6 +2380,77 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	return alloced;
 }
 
+static
+int __rmqueue_compact(struct zone *zone, struct list_head *list, int pages,
+						unsigned long pageblock_exclude)
+{
+	unsigned int order;
+	struct page *page, *next;
+	int mtype;
+	int fallback;
+	struct list_head * free_list;
+	LIST_HEAD(skip_list);
+	int queued_pages = 0;
+
+	for (order = 0; order < MAX_ORDER; ++order) {
+		for (mtype = MIGRATE_MOVABLE, fallback = 0;
+		     mtype != MIGRATE_TYPES;
+		     mtype = fallbacks[MIGRATE_MOVABLE][fallback++]) {
+
+			free_list = &zone->free_area[order].free_list[mtype];
+			list_for_each_entry_safe(page, next, free_list, lru) {
+				if (page_to_pfn(page) >> pageblock_order
+							== pageblock_exclude) {
+					list_move(&page->lru, &skip_list);
+					count_vm_event(COMPACTFREE_LIST_SKIP);
+					continue;
+				}
+
+
+				list_move(&page->lru, list);
+				zone->free_area[order].nr_free--;
+				rmv_page_order(page);
+				set_page_private(page, order);
+
+				__mod_zone_freepage_state(zone, -(1UL << order),
+					get_pageblock_migratetype(page));
+
+				queued_pages += 1 << order;
+				if (queued_pages >= pages)
+					break;
+			}
+			/*
+			 * Put skipped pages at the end of free list so we are
+			 * less likely to encounter them again.
+			 */
+			list_splice_tail_init(&skip_list, free_list);
+		}
+	}
+	count_vm_events(COMPACTFREE_LIST_ALLOC, queued_pages);
+	count_vm_events(COMPACTISOLATED, queued_pages);
+	return queued_pages;
+}
+
+int alloc_pages_compact(struct zone *zone, struct list_head *list, int pages,
+						unsigned long pageblock_exclude)
+{
+	unsigned long flags;
+	unsigned long watermark;
+	int queued_pages;
+
+	watermark = low_wmark_pages(zone) + pages;
+	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		return 0;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	queued_pages = __rmqueue_compact(zone, list, pages, pageblock_exclude);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return queued_pages;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the vmstat counter updater to drain pagesets of this
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e0593434fd58..765758d52539 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1222,6 +1222,8 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_COMPACTION
 	"compact_migrate_scanned",
 	"compact_free_scanned",
+	"compact_free_list_alloc",
+	"compact_free_list_skip",
 	"compact_isolated",
 	"compact_stall",
 	"compact_fail",
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>