Re: [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages()

MengEn Sun <mengensun88@xxxxxxxxx> · Wed, 8 Jan 2025 01:39:38 +0800

Hi, David 
> 
> >  		else
> >  			drain_pages(cpu);
> > +		cond_resched();
> >  	}
> >  
> >  	mutex_unlock(&pcpu_drain_mutex);
> 
> This is another example of a soft lockup that we haven't observed and we 
> have systems with many more cores than 64.

It seems that the cause of this issue is not related to the number of CPUs,
but rather more to the ratio of memory capacity to the number of CPUs, or
the total memory capacity itself.

For example, my machine has 64 CPUs and 256 GB of memory, with a single
NUMA node. Under the current kernel, for a single zone, the amount of memory
that can be allocated to the PCP (Per-CPU Pool) across all CPUs is
approximately one-eighth of the total memory in that zone.

So, in the worst-case scenario on my machine:
The total memory in the NORMAL zone is about 32 GB (one-eighth of the total),
and with 64 CPUs, each CPU can receive approximately 512 MB of memory in the
worst case. With a page size of 4 KB, this means each CPU has about 128,000
(100K+) pages in the PCP.

Although the PCP auto-tune algorithm starts to compress the PCP capacity when
memory is tight (for example, when it falls below the high watermark or
during memory reclamation in the zone), this relies on memory allocation and
release on the CPU or a delayed work to trigger this action.   However, the
delayed work and memory allocation/release actions are not very controllable.

> 
> Is this happening because of contention on pcp->lock or zone->lock?  I 
> would assume the latter, but best to confirm.

You are right, because we are conducting memory stress testing, and
zone->lock is indeed a hotspot.

> I think this is just papering over a scalability problem with zone->lock.  
> How many NUMA nodes and zones does this 223GB system have?
> 
> If this is a problem with zone->lock, this problem should likely be 
> addressed more holistically.

You are right; the zone->lock issue can indeed become a hotspot in larger
machines. However, I feel that fundamentally solving it is not very easy.

This PCP feature adopts an approach of aggregating tasks for batch
processing.

Another idea is to break the critical sections into smaller ones, but
I'm not sure if this approach is feasible.

Best Regards