On Thu, May 23, 2024 at 01:53:37PM +0100, Karim Manaouil wrote: > On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote: > > I suffered from kswapd stopped in the following scenario: > > > > CONFIG_NUMA_BALANCING enabled > > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > > numa node0 (500GB local DRAM, 128 CPUs) > > numa node1 (100GB CXL memory, no CPUs) > > swap off > > > > 1) Run any workload using a lot of anon pages e.g. mmap(200GB). > > 2) Keep adding another workload using a lot of anon pages. > > 3) The DRAM becomes filled with only anon pages through promotion. > > 4) Demotion barely works due to severe memory pressure. > > 5) kswapd for node0 stops because of the unreclaimable anon pages. > > It's not very clear to me, but if I understand correctly, if you have I don't have free memory on CXL. > free memory on CXL, kswapd0 should not stop as long as demotion is kswapd0 stops because demotion barely works. > successfully migrating the pages from DRAM to CXL and returns that as > nr_reclaimed in shrink_folio_list()? > > If that's the case, kswapd0 is making progress and shouldn't give up. It's not the case. > If CXL memory is also filled and migration fails, then it doesn't make > sense to me to wake up kswapd0 as it obvisoly won't help with anything, It's true *only* when it won't help with anything. However, kswapd should work again once the system got back to normal e.g. by terminating the anon hoggers. I addressed this issue. > because, you guessed it, you have no memory in the first place!! > > > 6) Manually kill the memory hoggers. This is the point. Byungchul > > 7) kswapd is still stopped even though the system got back to normal. > > > > From now on, the system should run without reclaim service in background > > served by kswapd until direct reclaim will do for that. Even worse, > > tiering mechanism is no longer able to work because kswapd has stopped > > that the mechanism relies on. > > > > However, after 6), the DRAM will be filled with pages that might or > > might not be reclaimable, that depends on how those are going to be used. > > Since those are potentially reclaimable anyway, it's worth hopefully > > trying reclaim by allowing kswapd to work again if needed. > > > > Signed-off-by: Byungchul Park <byungchul@xxxxxx> > > --- > > include/linux/mmzone.h | 4 ++++ > > mm/page_alloc.c | 12 ++++++++++++ > > mm/vmscan.c | 21 ++++++++++++++++----- > > 3 files changed, 32 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index c11b7cde81ef..7c0ba90ea7b4 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1331,6 +1331,10 @@ typedef struct pglist_data { > > enum zone_type kswapd_highest_zoneidx; > > > > int kswapd_failures; /* Number of 'reclaimed == 0' runs */ > > + int nr_may_reclaimable; /* Number of pages that have been > > + allocated since considered the > > + node is hopeless due to too many > > + kswapd_failures. */ > > > > #ifdef CONFIG_COMPACTION > > int kcompactd_max_order; > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 14d39f34d336..1dd2daede014 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, > > unsigned int alloc_flags) > > { > > + pg_data_t *pgdat = page_pgdat(page); > > + > > post_alloc_hook(page, order, gfp_flags); > > > > + /* > > + * New pages might or might not be reclaimable depending on how > > + * these pages are going to be used. However, since these are > > + * potentially reclaimable, it's worth hopefully trying reclaim > > + * by allowing kswapd to work again even if there have been too > > + * many ->kswapd_failures, if ->nr_may_reclaimable is big enough. > > + */ > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + pgdat->nr_may_reclaimable += 1 << order; > > + > > if (order && (gfp_flags & __GFP_COMP)) > > prep_compound_page(page, order); > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 3ef654addd44..5b39090c4ef1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control * > > done: > > /* kswapd should never fail */ > > pgdat->kswapd_failures = 0; > > + pgdat->nr_may_reclaimable = 0; > > } > > > > /****************************************************************************** > > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > * sleep. On reclaim progress, reset the failure counter. A > > * successful direct reclaim run will revive a dormant kswapd. > > */ > > - if (reclaimable) > > + if (reclaimable) { > > pgdat->kswapd_failures = 0; > > - else if (sc->cache_trim_mode) > > + pgdat->nr_may_reclaimable = 0; > > + } else if (sc->cache_trim_mode) > > sc->cache_trim_mode_failed = 1; > > } > > > > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat) > > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > > } > > > > +static bool may_recaimable(pg_data_t *pgdat, int order) > > +{ > > + return pgdat->nr_may_reclaimable >= 1 << order; > > +} > > + > > /* > > * Prepare kswapd for sleeping. This verifies that there are no processes > > * waiting in throttle_direct_reclaim() and that watermarks have been met. > > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, > > wake_up_all(&pgdat->pfmemalloc_wait); > > > > /* Hopeless node, leave it to direct reclaim */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) > > return true; > > > > if (pgdat_balanced(pgdat, order, highest_zoneidx)) { > > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > > goto restart; > > } > > > > - if (!sc.nr_reclaimed) > > + if (!sc.nr_reclaimed) { > > pgdat->kswapd_failures++; > > + pgdat->nr_may_reclaimable = 0; > > + } > > > > out: > > clear_reclaim_active(pgdat, highest_zoneidx); > > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > > return; > > > > /* Hopeless node, leave it to direct reclaim if possible */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || > > + if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) || > > (pgdat_balanced(pgdat, order, highest_zoneidx) && > > !pgdat_watermark_boosted(pgdat, highest_zoneidx))) { > > /* > > -- > > 2.17.1 > > > >