Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 24 Oct 2019 11:59:43 -0700 (PDT)

On Wed, 23 Oct 2019, Vlastimil Babka wrote:

> From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@xxxxxxx>
> Date: Tue, 1 Oct 2019 14:20:58 +0200
> Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
>  all-node allocations
> 
> THP page faults now attempt a __GFP_THISNODE allocation first, which should
> only compact existing free memory, followed by another attempt that can
> allocate from any node using reclaim/compaction effort specified by global
> defrag setting and madvise.
> 
> This patch makes the following changes to the scheme:
> 
> - before the patch, the first allocation relies on a check for pageblock order
>   and __GFP_IO to prevent excessive reclaim. This however affects also the
>   second attempt, which is not limited to single node. Instead of that, reuse
>   the existing check for costly order __GFP_NORETRY allocations, and make sure
>   the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
>   __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
>   previously they only bailed out when compaction was deferred due to previous
>   failures. This should be still acceptable within the __GFP_NORETRY semantics.
> 
> - before the patch, the second allocation attempt (on all nodes) was passing
>   __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
>   above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
>   some effort to allocate THP is requested. After this patch, the second
>   attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.
> 
> To sum up, THP page faults now try the following attempt:
> 
> 1. local node only THP allocation with no reclaim, just compaction.
> 2. THP allocation from any node with effort determined by global defrag setting
>    and VMA madvise
> 3. fallback to base pages on any node
> 
> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> ---
>  mm/mempolicy.c  | 16 +++++++++-------
>  mm/page_alloc.c | 24 +++++-------------------
>  2 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * First, try to allocate THP only on local node, but
> +			 * don't reclaim unnecessarily, just compact.
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
>  
>  			/*
> -			 * If hugepage allocations are configured to always
> -			 * synchronous compact or the vma has been madvised
> -			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * If that fails, allow both compaction and reclaim,
> +			 * but on all nodes.
>  			 */
> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +			if (!page)
>  				page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_NORETRY, order);
> +								gfp, order);
>  
>  			goto out;
>  		}

Hi Vlastimil,

For the default case where thp enabled is not set to "always" and the VMA 
is not madvised for MADV_HUGEPAGE, how does this prefer to return node 
local pages rather than remote hugepages?  The idea is to optimize for 
access latency when the vma has not been explicitly madvised.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ecc3dbad606b..36d7d852f7b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> -		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
> -		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
> +		/*
> +		 * Checks for costly allocations with __GFP_NORETRY, which
> +		 * includes some THP page fault allocations
> +		 */
> +		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>  			/*
>  			 * If allocating entire pageblock(s) and compaction
>  			 * failed because all zones are below low watermarks
> @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			if (compact_result == COMPACT_SKIPPED ||
>  			    compact_result == COMPACT_DEFERRED)
>  				goto nopage;
> -		}
> -
> -		/*
> -		 * Checks for costly allocations with __GFP_NORETRY, which
> -		 * includes THP page fault allocations
> -		 */
> -		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> -			/*
> -			 * If compaction is deferred for high-order allocations,
> -			 * it is because sync compaction recently failed. If
> -			 * this is the case and the caller requested a THP
> -			 * allocation, we do not want to heavily disrupt the
> -			 * system, so we fail the allocation instead of entering
> -			 * direct reclaim.
> -			 */
> -			if (compact_result == COMPACT_DEFERRED)
> -				goto nopage;
>  
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
> -- 
> 2.23.0
> 
>