On Fri, 10 Sep 2021 17:11:05 -0700 Mike Kravetz wrote: >On 9/10/21 1:20 AM, Michal Hocko wrote: >> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote: >>> >>> I would say it's simply should_continue_reclaim() behaving similarly to >>> should_compact_retry(). We'll get compaction_suitable() returning >>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen, >>> and compaction indicates there's not enough base pages to begin with to form >>> a high-order pages. Since the stolen pages will appear on inactive lru, it >>> seems to be worth continuing reclaim to make enough free base pages for >>> compaction to no longer be skipped, because "inactive_lru_pages > >>> pages_for_compaction" is true. >>> >>> So, both should_continue_reclaim() and should_compact_retry() are unable to >>> recognize that reclaimed pages are being stolen and limit the retries in >>> that case. The scenario seems to be uncommon, otherwise we'd be getting more >>> reports of that. >> >> Thanks this sounds like a viable explanation. For some reason I have >> thought that a single reclaim round can take that much time but reading >> again it seems like a cumulative thing. >> >> I do agree that we should consider the above scenario when bailing out. >> It is simply unreasonable to reclaim so much memory without any forward >> progress. > >A very simple patch to bail early eliminated the LONG stalls and >decreased the time of a bulk allocation request. > >Recall, the use case is converting 1G huge pages to the corresponding >amount of 2MB huge pages. I have been told that 50GB is not an uncommon >amount of pages to 'convert'. So, test case is free 50GB hugetlb pages >followed by allocate 25600 2MB hugetlb pages. I have not done enough >runs to get anything statistically valid, but here are some ballpark >numbers. > >Unmodified baseline: >------------------- >Unexpected number of 2MB pages after demote > Expected 25600, have 19944 > >real 0m42.092s >user 0m0.008s >sys 0m41.467s > >Retries capped by patch below: >------------------------------ >Unexpected number of 2MB pages after demote > Expected 25600, have 19395 > >real 0m12.951s >user 0m0.010s >sys 0m12.840s > > >The time of the operation is certainly better, but do note that we >allocated 549 fewer pages. So, bailing early might cause some failures >if we continue trying. It is all a tradeoff. A hard one really to reach with all parties considerations met. >One of the reasons for >considering demote functionality is that the conversion would be done in >the hugetlb code without getting the page allocator, recalim, >compaction, etc involved. Particularly when this involvement is a must have. Given the role of stolen pages in the stall, a bisect run is throttle the concurrent bulk allocators before a hammer on the stall. static int hugetlb_2M_page_allocators; static DECLARE_WAIT_QUEUE_HEAD(hugetlb_2M_page_allocators_wq); static DEFINE_MUTEX(hugetlb_2M_page_allocators_mutex); static bool inc_hugetlb_2M_page_allocators(void) { bool rc = false; mutex_lock(&hugetlb_2M_page_allocators_mutex); if (hugetlb_2M_page_allocators < 2) { rc = true; ++hugetlb_2M_page_allocators; } mutex_unlock(&hugetlb_2M_page_allocators_mutex); return rc; } static void dec_hugetlb_2M_page_allocators(void) { mutex_lock(&hugetlb_2M_page_allocators_mutex); --hugetlb_2M_page_allocators; wake_up(&hugetlb_2M_page_allocators_wq); mutex_unlock(&hugetlb_2M_page_allocators_mutex); } static struct page *throttle_hugetlb_2M_page_allocators(void) { struct page *page; wait_event(hugetlb_2M_page_allocators_wq, true == inc_hugetlb_2M_page_allocators()); page = __alloc_pages_nodemask(gfp_mask, order2M, nid, nmask); dec_hugetlb_2M_page_allocators(); return page; } > >> A more robust way to address this problem (which is not really new) >> would be to privatize reclaimed pages in the direct reclaim context and >> reuse them in the compaction so that it doesn't have to just >> pro-actively reclaim to keep watermarks good enough. A normal low order >> requests would benefit from that as well because the reclaim couldn't be >> stolen from them as well. > >That does sound interesting. Perhaps a longer term project. > >> Another approach would be to detect the situation and treat reclaim >> success which immediatelly leads to compaction deferral due to >> watermarks as a failure. > >I'll look into detecting and responding to this. > >Simple limit retries patch: > >diff --git a/mm/page_alloc.c b/mm/page_alloc.c >index eeb3a9c..16f3055 100644 >--- a/mm/page_alloc.c >+++ b/mm/page_alloc.c >@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > int no_progress_loops; > unsigned int cpuset_mems_cookie; > int reserve_flags; >+ unsigned long tries = 0, max_tries = 0; > > /* > * We also sanity check to catch abuse of atomic reserves being used by >@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == > (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) > gfp_mask &= ~__GFP_ATOMIC; >+ if (order > PAGE_ALLOC_COSTLY_ORDER) >+ max_tries = 1Ul << order; > > retry_cpuset: > compaction_retries = 0; >@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > } > > retry: >+ tries++; > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); >@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > if (did_some_progress > 0 && > should_compact_retry(ac, order, alloc_flags, > compact_result, &compact_priority, >- &compaction_retries)) >+ &compaction_retries)) { >+ /* >+ * In one pathological case, pages can be stolen immediately >+ * after reclaimed. It looks like we are making progress, and >+ * compaction_retries is not incremented. This could cause >+ * an indefinite number of retries. Cap the number of retries >+ * for costly orders. >+ */ >+ if (max_tries && tries > max_tries) >+ goto nopage; > goto retry; >+ } > > > /* Deal with possible cpuset update races before we start OOM killing */ >diff --git a/mm/vmscan.c b/mm/vmscan.c >index eeae2f6..519e16e 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > struct lruvec *target_lruvec; > bool reclaimable = false; > unsigned long file; >+ unsigned long tries = 0, max_tries = 0; > > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); >+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER) >+ max_tries = 1UL << sc->order; > > again: >+ tries++; > memset(&sc->nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; >@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > wait_iff_congested(BLK_RW_ASYNC, HZ/10); > > if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, >- sc)) >+ sc)) { >+ /* >+ * In one pathological case, pages can be stolen immediately >+ * after being reclaimed. This can cause an indefinite number >+ * of retries. Limit the number of retries for costly orders. >+ */ >+ if (!current_is_kswapd() && >+ max_tries && tries > max_tries && >+ sc->nr_reclaimed > sc->nr_to_reclaim) >+ goto done; > goto again; >+ } > >+done: > /* > * Kswapd gives up on balancing particular nodes after too > * many failures to reclaim anything from them and goes to > > >-- >Mike Kravetz