Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 1 Oct 2019 13:31:45 -0700 (PDT)

On Tue, 1 Oct 2019, Vlastimil Babka wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> +			/*
> +			 * First, try to allocate THP only on local node, but
> +			 * don't reclaim unnecessarily, just compact.
> +			 */
>  			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +				gfp | __GFP_THISNODE | __GFP_NORETRY, order);

The page allocator has heuristics to determine when compaction should be 
retried, reclaim should be retried, and the allocation itself should retry 
for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER).  
PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior 
when reclaim itself is unlikely -- or disruptive enough -- in making that 
amount of contiguous memory available.

Rather than papering over the poor feedback loop between compaction and 
reclaim that exists in the page allocator, it's probably better to improve 
that and determine when an allocation should fail or it's worthwhile to 
retry.  That's a separate topic from NUMA locality of thp.

In other words, we should likely address how compaction and reclaim is 
done for all high order-allocations in the page allocator itself rather 
than only here for hugepage allocations and relying on specific gfp bits 
to be set.  Ask: if the allocation here should not retry regardless of why 
compaction failed, why should any high-order allocation (user or kernel) 
retry if compaction failed and at what order we should just fail?  If 
hugetlb wants to stress this to the fullest extent possible, it already 
appropriately uses __GFP_RETRY_MAYFAIL.

The code here is saying we care more about NUMA locality than hugepages 
simply because that's where the access latency is better and is specific 
to hugepages; allocation behavior for high-order pages needs to live in 
the page allocator.

>  
>  			/*
> -			 * If hugepage allocations are configured to always
> -			 * synchronous compact or the vma has been madvised
> -			 * to prefer hugepage backing, retry allowing remote
> -			 * memory as well.
> +			 * If that fails, allow both compaction and reclaim,
> +			 * but on all nodes.
>  			 */
> -			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> +			if (!page)
>  				page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_NORETRY, order);
> +								gfp, order);
>  
>  			goto out;
>  		}

We certainly don't want this for non-MADV_HUGEPAGE regions when thp 
enabled bit is not set to "always".  We'd much rather fallback to local 
native pages because of its improved access latency.  This is allowing all 
hugepage allocations to be remote even without MADV_HUGEPAGE which is not 
even what Andrea needs: qemu uses MADV_HUGEPAGE.