On 9/5/19 11:00 AM, Michal Hocko wrote: > [Ccing Mike for checking on the hugetlb side of this change] > > On Wed 04-09-19 12:54:22, David Rientjes wrote: >> Memory compaction has a couple significant drawbacks as the allocation >> order increases, specifically: >> >> - isolate_freepages() is responsible for finding free pages to use as >> migration targets and is implemented as a linear scan of memory >> starting at the end of a zone, Note that's no longer entirely true, see fast_isolate_freepages(). >> - failing order-0 watermark checks in memory compaction does not account >> for how far below the watermarks the zone actually is: to enable >> migration, there must be *some* free memory available. Per the above, >> watermarks are not always suffficient if isolate_freepages() cannot >> find the free memory but it could require hundreds of MBs of reclaim to >> even reach this threshold (read: potentially very expensive reclaim with >> no indication compaction can be successful), and I doubt it's hundreds of MBs for a 2MB hugepage. >> - if compaction at this order has failed recently so that it does not even >> run as a result of deferred compaction, looping through reclaim can often >> be pointless. Agreed. >> For hugepage allocations, these are quite substantial drawbacks because >> these are very high order allocations (order-9 on x86) and falling back to >> doing reclaim can potentially be *very* expensive without any indication >> that compaction would even be successful. You seem to lump together hugetlbfs and THP here, by saying "hugepage", but these are very different things - hugetlbfs reservations are expected to be potentially expensive. >> Reclaim itself is unlikely to free entire pageblocks and certainly no >> reliance should be put on it to do so in isolation (recall lumpy reclaim). >> This means we should avoid reclaim and simply fail hugepage allocation if >> compaction is deferred. It is however possible that reclaim frees enough to make even a previously deferred compaction succeed. >> It is also not helpful to thrash a zone by doing excessive reclaim if >> compaction may not be able to access that memory. If order-0 watermarks >> fail and the allocation order is sufficiently large, it is likely better >> to fail the allocation rather than thrashing the zone. >> >> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> >> --- >> mm/page_alloc.c | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> if (page) >> goto got_pg; >> >> + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { >> + /* >> + * If allocating entire pageblock(s) and compaction >> + * failed because all zones are below low watermarks >> + * or is prohibited because it recently failed at this >> + * order, fail immediately. >> + * >> + * Reclaim is >> + * - potentially very expensive because zones are far >> + * below their low watermarks or this is part of very >> + * bursty high order allocations, >> + * - not guaranteed to help because isolate_freepages() >> + * may not iterate over freed pages as part of its >> + * linear scan, and >> + * - unlikely to make entire pageblocks free on its >> + * own. >> + */ >> + if (compact_result == COMPACT_SKIPPED || >> + compact_result == COMPACT_DEFERRED) >> + goto nopage; As I said, I expect this will make hugetlbfs reservations fail prematurely - Mike can probably confirm or disprove that. I think it also addresses consequences, not the primary problem, IMHO. I believe the primary problem is that we reclaim something even if there's enough memory for compaction. This won't change with your patch, as compact_result won't be SKIPPED in that case. Then we continue through to __alloc_pages_direct_reclaim(), shrink_zones() which will call compaction_ready(), which will only return true and skip reclaim of the zone, if there's high_watermark (!!!) + compact_gap() pages. But as long as one zone isn't compaction_ready(), we enter shrink_node(), which will reclaim something and call should_continue_reclaim() where we might finally notice that compaction_suitable() returns CONTINUE, and abort reclaim. Thus I think the right solution might be to really avoid reclaim for zones where compaction is not skipped, while your patch avoids reclaim when compaction is skipped. The per-node reclaim vs per-zone compaction might complicate those decisions a lot, though. >> + } >> + >> /* >> * Checks for costly allocations with __GFP_NORETRY, which >> * includes THP page fault allocations >