Re: [PATCH] mm: let kswapd work again for node that used to be hopeless but may not now

Byungchul Park <byungchul@xxxxxx> · Fri, 24 May 2024 09:47:31 +0900

On Thu, May 23, 2024 at 01:53:37PM +0100, Karim Manaouil wrote:
> On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote:
> > I suffered from kswapd stopped in the following scenario:
> > 
> >    CONFIG_NUMA_BALANCING enabled
> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >    numa node0 (500GB local DRAM, 128 CPUs)
> >    numa node1 (100GB CXL memory, no CPUs)
> >    swap off
> > 
> >    1) Run any workload using a lot of anon pages e.g. mmap(200GB).
> >    2) Keep adding another workload using a lot of anon pages.
> >    3) The DRAM becomes filled with only anon pages through promotion.
> >    4) Demotion barely works due to severe memory pressure.
> >    5) kswapd for node0 stops because of the unreclaimable anon pages.
> 
> It's not very clear to me, but if I understand correctly, if you have

I don't have free memory on CXL.

> free memory on CXL, kswapd0 should not stop as long as demotion is

kswapd0 stops because demotion barely works.

> successfully migrating the pages from DRAM to CXL and returns that as
> nr_reclaimed in shrink_folio_list()? 
> 
> If that's the case, kswapd0 is making progress and shouldn't give up.

It's not the case.

> If CXL memory is also filled and migration fails, then it doesn't make
> sense to me to wake up kswapd0 as it obvisoly won't help with anything,

It's true *only* when it won't help with anything.

However, kswapd should work again once the system got back to normal
e.g. by terminating the anon hoggers.  I addressed this issue.

> because, you guessed it, you have no memory in the first place!!
> 
> >    6) Manually kill the memory hoggers.

This is the point.

	Byungchul

> >    7) kswapd is still stopped even though the system got back to normal.
> > 
> > From now on, the system should run without reclaim service in background
> > served by kswapd until direct reclaim will do for that.  Even worse,
> > tiering mechanism is no longer able to work because kswapd has stopped
> > that the mechanism relies on.
> > 
> > However, after 6), the DRAM will be filled with pages that might or
> > might not be reclaimable, that depends on how those are going to be used.
> > Since those are potentially reclaimable anyway, it's worth hopefully
> > trying reclaim by allowing kswapd to work again if needed.
> > 
> > Signed-off-by: Byungchul Park <byungchul@xxxxxx>
> > ---
> >  include/linux/mmzone.h |  4 ++++
> >  mm/page_alloc.c        | 12 ++++++++++++
> >  mm/vmscan.c            | 21 ++++++++++++++++-----
> >  3 files changed, 32 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c11b7cde81ef..7c0ba90ea7b4 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1331,6 +1331,10 @@ typedef struct pglist_data {
> >  	enum zone_type kswapd_highest_zoneidx;
> >  
> >  	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > +	int nr_may_reclaimable;		/* Number of pages that have been
> > +					   allocated since considered the
> > +					   node is hopeless due to too many
> > +					   kswapd_failures. */
> >  
> >  #ifdef CONFIG_COMPACTION
> >  	int kcompactd_max_order;
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 14d39f34d336..1dd2daede014 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> >  							unsigned int alloc_flags)
> >  {
> > +	pg_data_t *pgdat = page_pgdat(page);
> > +
> >  	post_alloc_hook(page, order, gfp_flags);
> >  
> > +	/*
> > +	 * New pages might or might not be reclaimable depending on how
> > +	 * these pages are going to be used.  However, since these are
> > +	 * potentially reclaimable, it's worth hopefully trying reclaim
> > +	 * by allowing kswapd to work again even if there have been too
> > +	 * many ->kswapd_failures, if ->nr_may_reclaimable is big enough.
> > +	 */
> > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > +		pgdat->nr_may_reclaimable += 1 << order;
> > +
> >  	if (order && (gfp_flags & __GFP_COMP))
> >  		prep_compound_page(page, order);
> >  
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3ef654addd44..5b39090c4ef1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
> >  done:
> >  	/* kswapd should never fail */
> >  	pgdat->kswapd_failures = 0;
> > +	pgdat->nr_may_reclaimable = 0;
> >  }
> >  
> >  /******************************************************************************
> > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  	 * sleep. On reclaim progress, reset the failure counter. A
> >  	 * successful direct reclaim run will revive a dormant kswapd.
> >  	 */
> > -	if (reclaimable)
> > +	if (reclaimable) {
> >  		pgdat->kswapd_failures = 0;
> > -	else if (sc->cache_trim_mode)
> > +		pgdat->nr_may_reclaimable = 0;
> > +	} else if (sc->cache_trim_mode)
> >  		sc->cache_trim_mode_failed = 1;
> >  }
> >  
> > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
> >  	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
> >  }
> >  
> > +static bool may_recaimable(pg_data_t *pgdat, int order)
> > +{
> > +	return pgdat->nr_may_reclaimable >= 1 << order;
> > +}
> > +
> >  /*
> >   * Prepare kswapd for sleeping. This verifies that there are no processes
> >   * waiting in throttle_direct_reclaim() and that watermarks have been met.
> > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order,
> >  		wake_up_all(&pgdat->pfmemalloc_wait);
> >  
> >  	/* Hopeless node, leave it to direct reclaim */
> > -	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
> > +	    !may_recaimable(pgdat, order))
> >  		return true;
> >  
> >  	if (pgdat_balanced(pgdat, order, highest_zoneidx)) {
> > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  		goto restart;
> >  	}
> >  
> > -	if (!sc.nr_reclaimed)
> > +	if (!sc.nr_reclaimed) {
> >  		pgdat->kswapd_failures++;
> > +		pgdat->nr_may_reclaimable = 0;
> > +	}
> >  
> >  out:
> >  	clear_reclaim_active(pgdat, highest_zoneidx);
> > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
> >  		return;
> >  
> >  	/* Hopeless node, leave it to direct reclaim if possible */
> > -	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
> > +	if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
> > +	     !may_recaimable(pgdat, order)) ||
> >  	    (pgdat_balanced(pgdat, order, highest_zoneidx) &&
> >  	     !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
> >  		/*
> > -- 
> > 2.17.1
> > 
> >