On 5/21/21 3:28 AM, Mel Gorman wrote: > Note that in this patch the pcp->high values are adjusted after memory > hotplug events, min_free_kbytes adjustments and watermark scale factor > adjustments but not CPU hotplug events. Not that it was a long wait to figure it out, but I'd probably say: "CPU hotplug events are handled later in the series". instead of just saying they're not handled. > Before grep -E "high:|batch" /proc/zoneinfo | tail -2 > high: 378 > batch: 63 > > After grep -E "high:|batch" /proc/zoneinfo | tail -2 > high: 649 > batch: 63 You noted the relationship between pcp->high and zone lock contention. Larger ->high values mean less contention. It's probably also worth noting the trend of having more logical CPUs per NUMA node. I have the feeling when this was put in place it wasn't uncommon to have somewhere between 1 and 8 CPUs in a node pounding on a zone. Today, having ~60 is common. I've occasionally resorted to recommending that folks enable hardware features like Sub-NUMA-Clustering [1] since it increases the number of zones and decreases the number of CPUs pounding on each zone lock. 1. https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a48f305f0381..bf5cdc466e6c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void) > /* Block until all are initialised */ > wait_for_completion(&pgdat_init_all_done_comp); > > - /* > - * The number of managed pages has changed due to the initialisation > - * so the pcpu batch and high limits needs to be updated or the limits > - * will be artificially small. > - */ > - for_each_populated_zone(zone) > - zone_pcp_update(zone); > - > /* > * We initialized the rest of the deferred pages. Permanently disable > * on-demand struct page initialization. > @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone) > int batch; > > /* > - * The per-cpu-pages pools are set to around 1000th of the > - * size of the zone. > + * The number of pages to batch allocate is either 0.1% Probably worth making that "~0.1%" just in case someone goes looking for the /1000 and can't find it. > + * of the zone or 1MB, whichever is smaller. The batch > + * size is striking a balance between allocation latency > + * and zone lock contention. > */ > - batch = zone_managed_pages(zone) / 1024; > - /* But no more than a meg. */ > - if (batch * PAGE_SIZE > 1024 * 1024) > - batch = (1024 * 1024) / PAGE_SIZE; > + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE); > batch /= 4; /* We effectively *= 4 below */ > if (batch < 1) > batch = 1; > @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone) > #endif > } > > +static int zone_highsize(struct zone *zone) > +{ > +#ifdef CONFIG_MMU > + int high; > + int nr_local_cpus; > + > + /* > + * The high value of the pcp is based on the zone low watermark > + * when reclaim is potentially active spread across the online > + * CPUs local to a zone. Note that early in boot that CPUs may > + * not be online yet. > + */ FWIW, I like the way the changelog talked about this a bit better, with the goal of avoiding background reclaim even in the face of a bunch of full pcp's. > + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))); > + high = low_wmark_pages(zone) / nr_local_cpus; I'm a little concerned that this might get out of hand on really big nodes with no CPUs. For persistent memory (which we *do* toss into the page allocator for volatile use), we can have multi-terabyte zones with no CPUs in the node. Also, while the CPUs which are on the node are the ones *most* likely to be hitting the ->high limit, we do *keep* a pcp for each possible CPU. So, the amount of memory which can actually be sequestered is num_online_cpus()*high. Right? *That* might really get out of hand if we have nr_local_cpus=1. We might want some overall cap on 'high', or even to scale it differently for the zone-local cpus' pcps versus remote.