When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there is the concern of a prefetch potentially evicting existing cachelines. This can be true for L1D cache since it is not huge. Considering the prefetch instruction used is prefetchnta, which will only store the date in L2 for "Pentium 4 and Intel Xeon processors" according to Intel's "Instruction Manual Set" document, it is not likely to cause cache pollution. Other architectures may have this cache pollution problem though. There is also some additional instruction overhead, namely calculating buddy pfn twice. Since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.15-rc4 9037332 8000124 13642741 15728686 patch1/2 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% this patch 10462292 +8.9% 8602889 +2.8% 14802073 +5.4% 17624575 +1.1% Note: this patch's performance improvement percent is against patch1/2. Please also note the actual benefit of this patch will be workload/CPU dependant. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang <ying.huang@xxxxxxxxx> Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx> --- v2: update changelog according to Dave Hansen and Mel Gorman's comments. Add more comments in code to explain why prefetch is done as requested by Mel Gorman. mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c9e5ded39b16..6566a4b5b124 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); @@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>