On Tue, 1 Oct 2019, Vlastimil Babka wrote: > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ae967bcf954..2c48146f3ee2 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > nmask = policy_nodemask(gfp, pol); > if (!nmask || node_isset(hpage_node, *nmask)) { > mpol_cond_put(pol); > + /* > + * First, try to allocate THP only on local node, but > + * don't reclaim unnecessarily, just compact. > + */ > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_THISNODE, order); > + gfp | __GFP_THISNODE | __GFP_NORETRY, order); The page allocator has heuristics to determine when compaction should be retried, reclaim should be retried, and the allocation itself should retry for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior when reclaim itself is unlikely -- or disruptive enough -- in making that amount of contiguous memory available. Rather than papering over the poor feedback loop between compaction and reclaim that exists in the page allocator, it's probably better to improve that and determine when an allocation should fail or it's worthwhile to retry. That's a separate topic from NUMA locality of thp. In other words, we should likely address how compaction and reclaim is done for all high order-allocations in the page allocator itself rather than only here for hugepage allocations and relying on specific gfp bits to be set. Ask: if the allocation here should not retry regardless of why compaction failed, why should any high-order allocation (user or kernel) retry if compaction failed and at what order we should just fail? If hugetlb wants to stress this to the fullest extent possible, it already appropriately uses __GFP_RETRY_MAYFAIL. The code here is saying we care more about NUMA locality than hugepages simply because that's where the access latency is better and is specific to hugepages; allocation behavior for high-order pages needs to live in the page allocator. > > /* > - * If hugepage allocations are configured to always > - * synchronous compact or the vma has been madvised > - * to prefer hugepage backing, retry allowing remote > - * memory as well. > + * If that fails, allow both compaction and reclaim, > + * but on all nodes. > */ > - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > + if (!page) > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_NORETRY, order); > + gfp, order); > > goto out; > } We certainly don't want this for non-MADV_HUGEPAGE regions when thp enabled bit is not set to "always". We'd much rather fallback to local native pages because of its improved access latency. This is allowing all hugepage allocations to be remote even without MADV_HUGEPAGE which is not even what Andrea needs: qemu uses MADV_HUGEPAGE.