On 10/1/19 10:31 PM, David Rientjes wrote: > On Tue, 1 Oct 2019, Vlastimil Babka wrote: > >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ae967bcf954..2c48146f3ee2 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, >> nmask = policy_nodemask(gfp, pol); >> if (!nmask || node_isset(hpage_node, *nmask)) { >> mpol_cond_put(pol); >> + /* >> + * First, try to allocate THP only on local node, but >> + * don't reclaim unnecessarily, just compact. >> + */ >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_THISNODE, order); >> + gfp | __GFP_THISNODE | __GFP_NORETRY, order); > > The page allocator has heuristics to determine when compaction should be > retried, reclaim should be retried, and the allocation itself should retry > for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). Yes, and flags like __GFP_NORETRY and __GFP_RETRY_MAYFAIL affect these heuristics according to the caller's willingness to wait vs availability of some kind of fallback. > PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior > when reclaim itself is unlikely -- or disruptive enough -- in making that > amount of contiguous memory available. The __GFP_NORETRY check for costly orders is one of the implementation details of that poor allocator behavior avoidance. > Rather than papering over the poor feedback loop between compaction and > reclaim that exists in the page allocator, it's probably better to improve > that and determine when an allocation should fail or it's worthwhile to > retry. That's a separate topic from NUMA locality of thp. I don't disagree. But we are doing that 'papering over' already and I simply believe it can be done better, until we can drop it. Right now at least the hugetlb allocations are regressing, as Michal showed. > In other words, we should likely address how compaction and reclaim is > done for all high order-allocations in the page allocator itself rather > than only here for hugepage allocations and relying on specific gfp bits > to be set. Well, with your patches, decisions are made based on pageblock order (affecting also hugetlbfs) and a specific __GFP_IO bit. I don't see how using a __GFP_NORETRY and costly order is worse - both are concepts already used to guide reclaim/compaction decisions. And it won't affect hugetlbfs. Also with some THP defrag modes, __GFP_NORETRY is also already being used. > Ask: if the allocation here should not retry regardless of why > compaction failed, why should any high-order allocation (user or kernel) > retry if compaction failed and at what order we should just fail? If that's refering to the quoted code above, the allocation here should not retry because it would lead to reclaiming just the local node, effectively a node reclaim mode, leading to the problems Andrea and Mel reported. That's because __GFP_THISNODE makes it rather specific, so we shouldn't conclude anything for the other kinds of allocations, as you're asking. > If > hugetlb wants to stress this to the fullest extent possible, it already > appropriately uses __GFP_RETRY_MAYFAIL. Which doesn't work anymore right now, and should again after this patch. > The code here is saying we care more about NUMA locality than hugepages That's why there's __GFP_THISNODE, yes. > simply because that's where the access latency is better and is specific > to hugepages; allocation behavior for high-order pages needs to live in > the page allocator. I'd rather add the __GFP_NORETRY to __GFP_THISNODE here to influence the allocation behavior, than reintroduce any __GFP_THISNODE checks in the page allocator. There are not that many __GFP_THISNODE users, but there are plenty of __GFP_NORETRY users, so it seems better to adjust behavior based on the latter flag. IIRC in the recent past you argued for putting back __GFP_NORETRY to all THP allocations including madvise, so why is there a problem now? >> >> /* >> - * If hugepage allocations are configured to always >> - * synchronous compact or the vma has been madvised >> - * to prefer hugepage backing, retry allowing remote >> - * memory as well. >> + * If that fails, allow both compaction and reclaim, >> + * but on all nodes. >> */ >> - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) >> + if (!page) >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_NORETRY, order); Also this __GFP_NORETRY was added by your patch, so I really don't see why do you think it's wrong in the first allocation attempt. >> + gfp, order); >> >> goto out; >> } > > We certainly don't want this for non-MADV_HUGEPAGE regions when thp > enabled bit is not set to "always". We'd much rather fallback to local > native pages because of its improved access latency. This is allowing all > hugepage allocations to be remote even without MADV_HUGEPAGE which is not > even what Andrea needs: qemu uses MADV_HUGEPAGE. OTOH it's better to use a remote THP than a remote base page, in case the local node is below watermark. But that would probably require a larger surgery of the allocator. But what you're saying is keep the gfp & __GFP_DIRECT_RECLAIM condition. I still suggest to drop the __GFP_NORETRY as that goes against MADV_HUGEPAGE.