On 07/05/20 2:35 pm, Sandipan Das wrote: > > On 07/05/20 12:39 pm, Michal Hocko wrote: >> On Wed 06-05-20 17:50:28, Vlastimil Babka wrote: >>> [...] >>> >>> How about something like this: >>> >>> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst >>> index aaf1667489f8..08ec2c2bdce3 100644 >>> --- a/Documentation/admin-guide/numastat.rst >>> +++ b/Documentation/admin-guide/numastat.rst >>> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics >>> >>> All units are pages. Hugepages have separate counters. >>> >>> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes >>> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit >>> +is incremented on the preferred node, otherwise numa_foreign is incremented on >>> +the preferred node and numa_miss on the node where allocation succeeded. >>> + >>> +Usually preferred node is the one local to the CPU where the process executes, >>> +but restrictions such as mempolicies can change that, so there are also two >>> +counters based on CPU local node. local_node is similar to numa_hit and is >>> +incremented on allocation from a node by CPU on the same node. other_node is >>> +similar to numa_miss and is incremented on the node where allocation succeeds >>> +from a CPU from a different node. Note there is no counter analogical to >>> +numa_foreign. >>> + >>> +In more detail: >>> + >>> =============== ============================================================ >>> numa_hit A process wanted to allocate memory from this node, >>> and succeeded. >>> @@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node, >>> but ended up with memory from this node. >>> >>> numa_foreign A process wanted to allocate on this node, >>> - but ended up with memory from another one. >>> + but ended up with memory from another node. >>> >>> -local_node A process ran on this node and got memory from it. >>> +local_node A process ran on this node's CPU, >>> + and got memory from this node. >>> >>> -other_node A process ran on this node and got memory from another node. >>> +other_node A process ran on a different node's CPU >>> + and got memory from this node. >>> >>> interleave_hit Interleaving wanted to allocate from this node >>> and succeeded. >>> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package >>> (http://oss.sgi.com/projects/libnuma/). Note that it only works >>> well right now on machines with a small number of CPUs. >>> >>> +Note that on systems with memoryless nodes (where a node has CPUs but no >>> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed >>> +heavily. In the current kernel implementation, if a process prefers a >>> +memoryless node (i.e. because it is running on one of its local CPU), the >>> +implementation actually treats one of the nearest nodes with memory as the >>> +preferred node. As a result, such allocation will not increase the numa_foreign >>> +counter on the memoryless node, and will skew the numa_hit, numa_miss and >>> +numa_foreign statistics of the nearest node. >> >> This is certainly an improvement. Thanks! The question whether we can >> identify where bogus numbers came from would be interesting as well. >> Maybe those are not worth fixing but it would be great to understand >> them at least. I have to say that the explanation via boot_pageset is >> not really clear to me. >> > > The documentation update will definitely help. Thanks for that. > I did collect some stack traces on a ppc64 guest for calls to zone_statistics() > in case of zones that are using the boot_pageset and most of them originate > from kmem_cache_init() with eventual calls to allocate_slab(). > > [ 0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0 > [ 0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120 > [ 0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100 > [ 0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0 > [ 0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580 > [ 0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610 > ... > > In the remaining cases, the sources are ftrace_init() and early_trace_init(). > Forgot to add that this happens during the period between zone_pcp_init() and setup_zone_pageset(). - Sandipan