Hi, David > > > else > > drain_pages(cpu); > > + cond_resched(); > > } > > > > mutex_unlock(&pcpu_drain_mutex); > > This is another example of a soft lockup that we haven't observed and we > have systems with many more cores than 64. It seems that the cause of this issue is not related to the number of CPUs, but rather more to the ratio of memory capacity to the number of CPUs, or the total memory capacity itself. For example, my machine has 64 CPUs and 256 GB of memory, with a single NUMA node. Under the current kernel, for a single zone, the amount of memory that can be allocated to the PCP (Per-CPU Pool) across all CPUs is approximately one-eighth of the total memory in that zone. So, in the worst-case scenario on my machine: The total memory in the NORMAL zone is about 32 GB (one-eighth of the total), and with 64 CPUs, each CPU can receive approximately 512 MB of memory in the worst case. With a page size of 4 KB, this means each CPU has about 128,000 (100K+) pages in the PCP. Although the PCP auto-tune algorithm starts to compress the PCP capacity when memory is tight (for example, when it falls below the high watermark or during memory reclamation in the zone), this relies on memory allocation and release on the CPU or a delayed work to trigger this action. However, the delayed work and memory allocation/release actions are not very controllable. > > Is this happening because of contention on pcp->lock or zone->lock? I > would assume the latter, but best to confirm. You are right, because we are conducting memory stress testing, and zone->lock is indeed a hotspot. > I think this is just papering over a scalability problem with zone->lock. > How many NUMA nodes and zones does this 223GB system have? > > If this is a problem with zone->lock, this problem should likely be > addressed more holistically. You are right; the zone->lock issue can indeed become a hotspot in larger machines. However, I feel that fundamentally solving it is not very easy. This PCP feature adopts an approach of aggregating tasks for batch processing. Another idea is to break the critical sections into smaller ones, but I'm not sure if this approach is feasible. Best Regards