On 5/7/20 6:29 PM, Sandipan Das wrote: > Initially, the per-cpu pagesets of each zone are set to the > boot pagesets. The real pagesets are allocated later but > before that happens, page allocations do occur and the numa > stats for the boot pagesets get incremented since they are > common to all zones at that point. > > The real pagesets, however, are allocated for the populated > zones only. Unpopulated zones, like those associated with > memory-less nodes, continue using the boot pageset and end > up skewing the numa stats of the corresponding node. > > E.g. > > $ numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 > node 0 size: 0 MB > node 0 free: 0 MB > node 1 cpus: 4 5 6 7 > node 1 size: 8131 MB > node 1 free: 6980 MB > node distances: > node 0 1 > 0: 10 40 > 1: 40 10 > > $ numastat > node0 node1 > numa_hit 108 56495 > numa_miss 0 0 > numa_foreign 0 0 > interleave_hit 0 4537 > local_node 108 31547 > other_node 0 24948 > > Hence, the boot pageset stats need to be cleared after > the real pagesets are allocated. > > From this point onwards, the stats of the boot pagesets do > not change as page allocations requested for a memory-less > node will either fail (if __GFP_THISNODE is used) or get > fulfilled by a preferred zone of a different node based on > the fallback zonelist. > > Signed-off-by: Sandipan Das <sandipan@xxxxxxxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> With suggestion below. > --- > > The previous version and discussion around it can be found at > https://lore.kernel.org/linux-mm/20200504070304.127361-1-sandipan@xxxxxxxxxxxxx/ > > Changes in v2: > > - Reset the stats of the boot pagesets instead of explicitly > returning zero as suggested by Vlastimil. > > - Changed the subject to reflect the above. > > --- > mm/page_alloc.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 69827d4fa052..1543e32f7e4e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6256,6 +6256,25 @@ void __init setup_per_cpu_pageset(void) > for_each_populated_zone(zone) > setup_zone_pageset(zone); > > +#ifdef CONFIG_NUMA > + if (static_branch_likely(&vm_numa_stat_key)) { I would just remove this test and do it unconditionally, as the branch can be only disabled later in boot by a sysctl. > + struct per_cpu_pageset *pcp; > + int cpu; > + > + /* > + * Unpopulated zones continue using the boot pagesets. > + * The numa stats for these pagesets need to be reset. > + * Otherwise, they will end up skewing the stats of > + * the nodes these zones are associated with. > + */ > + for_each_possible_cpu(cpu) { > + pcp = &per_cpu(boot_pageset, cpu); > + memset(pcp->vm_numa_stat_diff, 0, > + sizeof(pcp->vm_numa_stat_diff)); > + } > + } > +#endif > + > for_each_online_pgdat(pgdat) > pgdat->per_cpu_nodestats = > alloc_percpu(struct per_cpu_nodestat); >