Re: [PATCH 5/5] mm/page_alloc: Limit number of high-order pages on PCP during bulk free

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 16 Feb 2022 15:37:52 +0100



On 2/15/22 15:51, Mel Gorman wrote:
> When a PCP is mostly used for frees then high-order pages can exist on PCP
> lists for some time. This is problematic when the allocation pattern is all
> allocations from one CPU and all frees from another resulting in colder
> pages being used. When bulk freeing pages, limit the number of high-order
> pages that are stored on the PCP lists.
> 
> Netperf running on localhost exhibits this pattern and while it does
> not matter for some machines, it does matter for others with smaller
> caches where cache misses cause problems due to reduced page reuse.
> Pages freed directly to the buddy list may be reused quickly while still
> cache hot where as storing on the PCP lists may be cold by the time
> free_pcppages_bulk() is called.
> 
> Using perf kmem:mm_page_alloc, the 5 most used page frames were
> 
> 5.17-rc3
>   13041 pfn=0x111a30
>   13081 pfn=0x5814d0
>   13097 pfn=0x108258
>   13121 pfn=0x689598
>   13128 pfn=0x5814d8
> 
> 5.17-revert-highpcp
>  192009 pfn=0x54c140
>  195426 pfn=0x1081d0
>  200908 pfn=0x61c808
>  243515 pfn=0xa9dc20
>  402523 pfn=0x222bb8
> 
> 5.17-full-series
>  142693 pfn=0x346208
>  162227 pfn=0x13bf08
>  166413 pfn=0x2711e0
>  166950 pfn=0x2702f8
> 
> The spread is wider as there is still time before pages freed to one
> PCP get released with a tradeoff between fast reuse and reduced zone
> lock acquisition.
> 
> From the machine used to gather the traces, the headline performance
> was equivalent.
> 
> netperf-tcp
>                             5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
>                                vanilla  mm-reverthighpcp-v1r1  mm-highpcplimit-v1r12
> Hmean     64         839.93 (   0.00%)      840.77 (   0.10%)      835.34 *  -0.55%*
> Hmean     128       1614.22 (   0.00%)     1622.07 *   0.49%*     1604.18 *  -0.62%*
> Hmean     256       2952.00 (   0.00%)     2953.19 (   0.04%)     2959.46 (   0.25%)
> Hmean     1024     10291.67 (   0.00%)    10239.17 (  -0.51%)    10287.05 (  -0.04%)
> Hmean     2048     17335.08 (   0.00%)    17399.97 (   0.37%)    17125.73 *  -1.21%*
> Hmean     3312     22628.15 (   0.00%)    22471.97 (  -0.69%)    22414.24 *  -0.95%*
> Hmean     4096     25009.50 (   0.00%)    24752.83 *  -1.03%*    24620.03 *  -1.56%*
> Hmean     8192     32745.01 (   0.00%)    31682.63 *  -3.24%*    32475.31 (  -0.82%)
> Hmean     16384    39759.59 (   0.00%)    36805.78 *  -7.43%*    39291.42 (  -1.18%)
> 
> From a 1-socket skylake machine with a small CPU cache that suffers
> more if cache misses are too high
> 
> netperf-tcp
>                             5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
>                                vanilla    mm-reverthighpcp-v1     mm-highpcplimit-v1
> Min       64         935.38 (   0.00%)      939.40 (   0.43%)      940.11 (   0.51%)
> Min       128       1831.69 (   0.00%)     1856.15 (   1.34%)     1849.30 (   0.96%)
> Min       256       3560.61 (   0.00%)     3659.25 (   2.77%)     3654.12 (   2.63%)
> Min       1024     13165.24 (   0.00%)    13444.74 (   2.12%)    13281.71 (   0.88%)
> Min       2048     22706.44 (   0.00%)    23219.67 (   2.26%)    23027.31 (   1.41%)
> Min       3312     30960.26 (   0.00%)    31985.01 (   3.31%)    31484.40 (   1.69%)
> Min       4096     35149.03 (   0.00%)    35997.44 (   2.41%)    35891.92 (   2.11%)
> Min       8192     48064.73 (   0.00%)    49574.05 (   3.14%)    48928.89 (   1.80%)
> Min       16384    58017.25 (   0.00%)    60352.93 (   4.03%)    60691.14 (   4.61%)
> Hmean     64         938.95 (   0.00%)      941.50 *   0.27%*      940.47 (   0.16%)
> Hmean     128       1843.10 (   0.00%)     1857.58 *   0.79%*     1855.83 *   0.69%*
> Hmean     256       3573.07 (   0.00%)     3667.45 *   2.64%*     3662.08 *   2.49%*
> Hmean     1024     13206.52 (   0.00%)    13487.80 *   2.13%*    13351.11 *   1.09%*
> Hmean     2048     22870.23 (   0.00%)    23337.96 *   2.05%*    23149.68 *   1.22%*
> Hmean     3312     31001.99 (   0.00%)    32206.50 *   3.89%*    31849.40 *   2.73%*
> Hmean     4096     35364.59 (   0.00%)    36490.96 *   3.19%*    36112.91 *   2.12%*
> Hmean     8192     48497.71 (   0.00%)    49954.05 *   3.00%*    49384.50 *   1.83%*
> Hmean     16384    58410.86 (   0.00%)    60839.80 *   4.16%*    61362.12 *   5.05%*
> 
> Note that this was a machine that did not benefit from caching high-order
> pages and performance is almost restored with the series applied. It's not
> fully restored as cache misses are still higher. This is a trade-off
> between optimising for a workload that does all allocs on one CPU and frees
> on another or more general workloads that need high-order pages for SLUB
> and benefit from avoiding zone->lock for every SLUB refill/drain.
> 
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx>

> ---
>  mm/page_alloc.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6881175b27df..cfb3cbad152c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3314,10 +3314,15 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
>  	return true;
>  }
>  
> -static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
> +static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch,
> +		       bool free_high)
>  {
>  	int min_nr_free, max_nr_free;
>  
> +	/* Free everything if batch freeing high-order pages. */
> +	if (unlikely(free_high))
> +		return pcp->count;
> +
>  	/* Check for PCP disabled or boot pageset */
>  	if (unlikely(high < batch))
>  		return 1;
> @@ -3338,11 +3343,12 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
>  	return batch;
>  }
>  
> -static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
> +static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> +		       bool free_high)
>  {
>  	int high = READ_ONCE(pcp->high);
>  
> -	if (unlikely(!high))
> +	if (unlikely(!high || free_high))
>  		return 0;
>  
>  	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> @@ -3362,17 +3368,27 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
>  	struct per_cpu_pages *pcp;
>  	int high;
>  	int pindex;
> +	bool free_high;
>  
>  	__count_vm_event(PGFREE);
>  	pcp = this_cpu_ptr(zone->per_cpu_pageset);
>  	pindex = order_to_pindex(migratetype, order);
>  	list_add(&page->lru, &pcp->lists[pindex]);
>  	pcp->count += 1 << order;
> -	high = nr_pcp_high(pcp, zone);
> +
> +	/*
> +	 * As high-order pages other than THP's stored on PCP can contribute
> +	 * to fragmentation, limit the number stored when PCP is heavily
> +	 * freeing without allocation. The remainder after bulk freeing
> +	 * stops will be drained from vmstat refresh context.
> +	 */
> +	free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
> +
> +	high = nr_pcp_high(pcp, zone, free_high);
>  	if (pcp->count >= high) {
>  		int batch = READ_ONCE(pcp->batch);
>  
> -		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp, pindex);
> +		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
>  	}
>  }
>