Re: [PATCH] mm: vmstat: Use zeroed stats for unpopulated zones

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/05/20 2:35 pm, Sandipan Das wrote:
> 
> On 07/05/20 12:39 pm, Michal Hocko wrote:
>> On Wed 06-05-20 17:50:28, Vlastimil Babka wrote:
>>> [...]
>>>
>>> How about something like this:
>>>
>>> diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
>>> index aaf1667489f8..08ec2c2bdce3 100644
>>> --- a/Documentation/admin-guide/numastat.rst
>>> +++ b/Documentation/admin-guide/numastat.rst
>>> @@ -6,6 +6,21 @@ Numa policy hit/miss statistics
>>>  
>>>  All units are pages. Hugepages have separate counters.
>>>  
>>> +The numa_hit, numa_miss and numa_foreign counters reflect how well processes
>>> +are able to allocate memory from nodes they prefer. If they succeed, numa_hit
>>> +is incremented on the preferred node, otherwise numa_foreign is incremented on
>>> +the preferred node and numa_miss on the node where allocation succeeded.
>>> +
>>> +Usually preferred node is the one local to the CPU where the process executes,
>>> +but restrictions such as mempolicies can change that, so there are also two
>>> +counters based on CPU local node. local_node is similar to numa_hit and is
>>> +incremented on allocation from a node by CPU on the same node. other_node is
>>> +similar to numa_miss and is incremented on the node where allocation succeeds
>>> +from a CPU from a different node. Note there is no counter analogical to
>>> +numa_foreign.
>>> +
>>> +In more detail:
>>> +
>>>  =============== ============================================================
>>>  numa_hit	A process wanted to allocate memory from this node,
>>>  		and succeeded.
>>> @@ -14,11 +29,13 @@ numa_miss	A process wanted to allocate memory from another node,
>>>  		but ended up with memory from this node.
>>>  
>>>  numa_foreign	A process wanted to allocate on this node,
>>> -		but ended up with memory from another one.
>>> +		but ended up with memory from another node.
>>>  
>>> -local_node	A process ran on this node and got memory from it.
>>> +local_node	A process ran on this node's CPU,
>>> +		and got memory from this node.
>>>  
>>> -other_node	A process ran on this node and got memory from another node.
>>> +other_node	A process ran on a different node's CPU
>>> +		and got memory from this node.
>>>  
>>>  interleave_hit 	Interleaving wanted to allocate from this node
>>>  		and succeeded.
>>> @@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
>>>  (http://oss.sgi.com/projects/libnuma/). Note that it only works
>>>  well right now on machines with a small number of CPUs.
>>>  
>>> +Note that on systems with memoryless nodes (where a node has CPUs but no
>>> +memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
>>> +heavily. In the current kernel implementation, if a process prefers a
>>> +memoryless node (i.e.  because it is running on one of its local CPU), the
>>> +implementation actually treats one of the nearest nodes with memory as the
>>> +preferred node. As a result, such allocation will not increase the numa_foreign
>>> +counter on the memoryless node, and will skew the numa_hit, numa_miss and
>>> +numa_foreign statistics of the nearest node.
>>
>> This is certainly an improvement. Thanks! The question whether we can
>> identify where bogus numbers came from would be interesting as well.
>> Maybe those are not worth fixing but it would be great to understand
>> them at least. I have to say that the explanation via boot_pageset is
>> not really clear to me.
>>
> 
> The documentation update will definitely help. Thanks for that.
> I did collect some stack traces on a ppc64 guest for calls to zone_statistics()
> in case of zones that are using the boot_pageset and most of them originate
> from kmem_cache_init() with eventual calls to allocate_slab().
> 
> [    0.000000] [c00000000282b690] [c000000000402d98] zone_statistics+0x138/0x1d0                                                                                                
> [    0.000000] [c00000000282b740] [c000000000401190] rmqueue_pcplist+0xf0/0x120                                                                                                 
> [    0.000000] [c00000000282b7d0] [c00000000040b178] get_page_from_freelist+0x2f8/0x2100                                                                                        
> [    0.000000] [c00000000282bb30] [c000000000401ae0] __alloc_pages_nodemask+0x1a0/0x2d0                                                                                         
> [    0.000000] [c00000000282bbc0] [c00000000044b040] alloc_slab_page+0x70/0x580                                                                                                 
> [    0.000000] [c00000000282bc20] [c00000000044b5f8] allocate_slab+0xa8/0x610                                                                                                   
> ...
> 
> In the remaining cases, the sources are ftrace_init() and early_trace_init().
> 

Forgot to add that this happens during the period between zone_pcp_init() and setup_zone_pageset().

- Sandipan




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux