+ mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Thu, 01 Mar 2018 16:10:19 -0800

The patch titled
     Subject: mm/free_pcppages_bulk: prefetch buddy while not holding lock
has been added to the -mm tree.  Its filename is
     mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Aaron Lu <aaron.lu@xxxxxxxxx>
Subject: mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked to
see if it's possible to do a merge.  This requires accessing buddy's page
structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in the hope that accessing buddy's page structure later under
zone->lock will be faster.  Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e.  the prefetched data
will never be unused.

In the meantime, there are two concerns:
1 the prefetch could potentially evict existing cachelines, especially
  for L1D cache since it is not huge;
2 there is some additional instruction overhead, namely calculating
  buddy pfn twice.

For 1, it's hard to say, this microbenchmark though shows good result but
the actual benefit of this patch will be workload/CPU dependant;

For 2, since the calculation is a XOR on two local variables, it's
expected in many cases that cycles spent will be offset by reduced memory
latency later.  This is especially true for NUMA machines where multiple
CPUs are contending on zone->lock and the most time consuming part under
zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages
and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.16-rc2+  9034215        7971818       13667135       15677465
patch2/3    9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@xxxxxxxxx)

Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@xxxxxxxxx
Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx>
Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
Reviewed-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Kemi Wang <kemi.wang@xxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/page_alloc.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff -puN mm/page_alloc.c~mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock mm/page_alloc.c

--- a/mm/page_alloc.c~mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock
+++ a/mm/page_alloc.c
@@ -1105,6 +1105,9 @@ static void free_pcppages_bulk(struct zo
 			batch_free = count;
 
 		do {
+			unsigned long pfn, buddy_pfn;
+			struct page *buddy;
+
 			page = list_last_entry(list, struct page, lru);
 			/* must delete to avoid corrupting pcp list */
 			list_del(&page->lru);
@@ -1114,6 +1117,18 @@ static void free_pcppages_bulk(struct zo
 				continue;
 
 			list_add_tail(&page->lru, &head);
+
+			/*
+			 * We are going to put the page back to the global
+			 * pool, prefetch its buddy to speed up later access
+			 * under zone->lock. It is believed the overhead of
+			 * calculating buddy_pfn here can be offset by reduced
+			 * memory latency later.
+			 */
+			pfn = page_to_pfn(page);
+			buddy_pfn = __find_buddy_pfn(pfn, 0);
+			buddy = page + (buddy_pfn - pfn);
+			prefetch(buddy);
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
_

Patches currently in -mm which might be from aaron.lu@xxxxxxxxx are

mm-free_pcppages_bulk-update-pcp-count-inside.patch
mm-free_pcppages_bulk-do-not-hold-lock-when-picking-pages-to-free.patch
mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html