On 1/8/19 12:16 PM, Mel Gorman wrote: > On Tue, Dec 11, 2018 at 03:29:39PM +0100, Vlastimil Babka wrote: >> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and >> madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more >> successful for processes that indicate that they are willing to pay a higher >> initial setup cost for long-term THP benefits. In the current page allocator >> implementation this means that the allocations will try to use reclaim and the >> more costly sync compaction mode, in case the initial direct async compaction >> fails. >> > > First off, I'm going to have trouble reasoning about this because there > is also my own series that reduces compaction failure rates. If that > series get approved, then it's far more likely that when compaction > fails that we really mean it and retries from the page allocator context > may be pointless unless the caller indicates it should really retry for > a long time. Understood. > The corner case I have in mind is a compaction failure on a very small > percentage of remaining pageblocks that are currently under writeback. > > It also means that the merit of this series needs to account for whether > it's before or after the compaction series as the impact will be > different. FWIW, I had the same problem with evaluating the compaction > series on the context of __GFP_THISNODE vs !__GFP_THISNODE Right. In that case I think for mainline, making compaction better has priority over trying to compensate for it. The question is if somebody wants to fix stable/older distro kernels. Now that it wasn't possible to remove the __GFP_THISNODE for THP's, I thought this might be an alternative acceptable to anyone, provided that it works. Backporting your compaction series would be much more difficult I guess. Of course distro kernels can also divert from mainline and go with the __GFP_THISNODE removal privately. >> However, THP faults also include __GFP_THISNODE, which, combined with direct >> reclaim, can result in a node-reclaim-like local node thrashing behavior, as >> reported by Andrea [1]. >> >> While this patch is not a full fix, the first step is to restore __GFP_NORETRY >> for madvised THP faults. The expected downside are potentially worse THP >> fault success rates for the madvised areas, which will have to then rely more >> on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity >> should be limited enough by sleeping to cause noticeable thrashing. >> >> Note that alloc_new_node_page() and new_page() is probably another candidate as >> they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow >> unprivileged node-reclaim-like behavior. >> >> The patch also updates the comments in alloc_hugepage_direct_gfpmask() because >> elsewhere compaction during page fault is called direct compaction, and >> 'synchronous' refers to the migration mode, which is not used for THP faults. >> >> [1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@xxxxxxxxxx >> >> Reported-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> >> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> Cc: David Rientjes <rientjes@xxxxxxxxxx> >> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >> --- >> mm/huge_memory.c | 13 ++++++------- >> 1 file changed, 6 insertions(+), 7 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 5da55b38b1b7..c442b12b060c 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) >> { >> const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); >> >> - /* Always do synchronous compaction */ >> + /* Always try direct compaction */ >> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) >> - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); >> + return GFP_TRANSHUGE | __GFP_NORETRY; >> > > While I get that you want to reduce thrashing, the configuration item > really indicates the system (not just the caller, but everyone) is willing > to take a hit to get a THP. Yeah some hit in exchange for THP's, but probably not an overreclaim due to __GFP_THISNODE implications. >> /* Kick kcompactd and fail quickly */ >> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) >> return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; >> >> - /* Synchronous compaction if madvised, otherwise kick kcompactd */ >> + /* Direct compaction if madvised, otherwise kick kcompactd */ >> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) >> return GFP_TRANSHUGE_LIGHT | >> - (vma_madvised ? __GFP_DIRECT_RECLAIM : >> + (vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY): >> __GFP_KSWAPD_RECLAIM); >> > > Similar note. > >> - /* Only do synchronous compaction if madvised */ >> + /* Only do direct compaction if madvised */ >> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) >> - return GFP_TRANSHUGE_LIGHT | >> - (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); >> + return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT; >> > > Similar note. > > That said, if this series went in before the compaction series then I > would think it's ok and I would Ack it. However, *after* the compaction > series, I think it would make more sense to rip out or severely reduce the > complex should_compact_retry logic and move towards "if compaction returns, > the page allocator should take its word for it except in extreme cases". OK. > Compaction previously was a bit too casual about partially scanning > backblocks and the setting/clearing of pageblock bits. The retry logic > sortof dealt with it by making reset/clear cycles very frequent but > after the series, they are relatively rare[1]. > > [1] Says he bravely before proven otherwise >