Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

kemi <kemi.wang@xxxxxxxxx> · Thu, 30 Nov 2017 17:32:08 +0800

On 2017年11月30日 16:53, Michal Hocko wrote:
> On Thu 30-11-17 13:56:13, kemi wrote:
>>
>>
>> On 2017年11月29日 20:17, Michal Hocko wrote:
>>> On Tue 28-11-17 14:00:23, Kemi Wang wrote:
>>>> The existed implementation of NUMA counters is per logical CPU along with
>>>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>>>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>>>> effect system's decision and are only read from /proc and /sys, it is a
>>>> slow path operation and likely tolerate higher overhead. Additionally,
>>>> usually nodes only have a single zone, except for node 0. And there isn't
>>>> really any use where you need these hits counts separated by zone.
>>>>
>>>> Therefore, we can migrate the implementation of numa stats from per-zone to
>>>> per-node, and get rid of these global numa counters. It's good enough to
>>>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>>>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>>>> (e.g. save more than 130+ lines code).
>>>
>>> I agree. Having these stats per zone is a bit of overcomplication. The
>>> only consumer is /proc/zoneinfo and I would argue this doesn't justify
>>> the additional complexity. Who does really need to know per zone broken
>>> out numbers?
>>>
>>> Anyway, I haven't checked your implementation too deeply but why don't
>>> you simply define static percpu array for each numa node?
>>
>> To be honest, there are another two ways I can think of listed below. but I don't
>> think they are simpler than my current implementation. Maybe you have better idea.
>>
>> static u64 __percpu vm_stat_numa[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS];
>> But it's not correct.
>>
>> Or we can add an u64 percpu array with size of NR_VM_NUMA_STAT_ITEMS in struct pglist_data.
>>
>> My current implementation is quite straightforward by combining all of local counters
>> together, only one percpu array with size of num_possible_nodes()*NR_VM_NUMA_STAT_ITEMS 
>> is enough for that.
> 
> Well, this is certainly a matter of taste. But let's have a look what we
> have currently. We have per zone, per node and numa stats. That looks one
> way to many to me. Why don't we simply move the whole numa stat thingy
> into per node stats? The code would simplify even more. We are going to
> lose /proc/zoneinfo per-zone data but we are losing those without your
> patch anyway. So I've just scratched the following on your patch and the
> cumulative diff looks even better
> 
>  drivers/base/node.c    |  22 ++---
>  include/linux/mmzone.h |  22 ++---
>  include/linux/vmstat.h |  38 +--------
>  mm/mempolicy.c         |   2 +-
>  mm/page_alloc.c        |  20 ++---
>  mm/vmstat.c            | 221 +------------------------------------------------
>  6 files changed, 30 insertions(+), 295 deletions(-)
> 
> I haven't tested it at all yet. This is just to show the idea.
> ---
> commit 92f8f58d1b6cb5c54a5a197a42e02126a5f7ea1a
> Author: Michal Hocko <mhocko@xxxxxxxx>
> Date:   Thu Nov 30 09:49:45 2017 +0100
> 
>     - move NUMA stats to node stats
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0be5fbdadaac..315156310c99 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -190,17 +190,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>  			     sum_zone_node_page_state(nid, i));
>  
> -#ifdef CONFIG_NUMA
> -	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> -		n += sprintf(buf+n, "%s %lu\n",
> -			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> -			     node_numa_state_snapshot(nid, i));
> -#endif
> -
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>  		n += sprintf(buf+n, "%s %lu\n",
> -			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> -			     NR_VM_NUMA_STAT_ITEMS],
> +			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     node_page_state(pgdat, i));
>  
>  	return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b2d264f8c0c6..2c9c8b13c44b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,20 +115,6 @@ struct zone_padding {
>  #define ZONE_PADDING(name)
>  #endif
>  
> -#ifdef CONFIG_NUMA
> -enum numa_stat_item {
> -	NUMA_HIT,		/* allocated in intended node */
> -	NUMA_MISS,		/* allocated in non intended node */
> -	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> -	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> -	NUMA_LOCAL,		/* allocation from local node */
> -	NUMA_OTHER,		/* allocation from other node */
> -	NR_VM_NUMA_STAT_ITEMS
> -};
> -#else
> -#define NR_VM_NUMA_STAT_ITEMS 0
> -#endif
> -
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> @@ -180,6 +166,12 @@ enum node_stat_item {
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
> +	NUMA_HIT,		/* allocated in intended node */
> +	NUMA_MISS,		/* allocated in non intended node */
> +	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
> +	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
> +	NUMA_LOCAL,		/* allocation from local node */
> +	NUMA_OTHER,		/* allocation from other node */
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index c07850f413de..cc1edd95e949 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -187,19 +187,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>  #endif
>  	return x;
>  }
> -
>  #ifdef CONFIG_NUMA
> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
> +extern unsigned long node_page_state(struct pglist_data *pgdat,
> +                                               enum node_stat_item item);
>  extern unsigned long sum_zone_node_page_state(int node,
>  					      enum zone_stat_item item);
> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
> -extern unsigned long node_page_state(struct pglist_data *pgdat,
> -						enum node_stat_item item);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
>  #endif /* CONFIG_NUMA */
> -
>  #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
>  #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
>  #define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..84e72f2b5748 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1939,7 +1939,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>  		return page;
>  	if (page && page_to_nid(page) == nid) {
>  		preempt_disable();
> -		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> +		inc_node_page_state(page, NUMA_INTERLEAVE_HIT);
>  		preempt_enable();
>  	}
>  	return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 044daba8c11a..c8e34157f7b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2785,25 +2785,25 @@ int __isolate_free_page(struct page *page, unsigned int order)
>   *
>   * Must be called with interrupts disabled.
>   */
> -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
> +static inline void zone_statistics(int preferred_nid, int page_nid)
>  {
>  #ifdef CONFIG_NUMA
> -	enum numa_stat_item local_stat = NUMA_LOCAL;
> +	enum node_stat_item local_stat = NUMA_LOCAL;
>  
>  	/* skip numa counters update if numa stats is disabled */
>  	if (!static_branch_likely(&vm_numa_stat_key))
>  		return;
>  
> -	if (z->node != numa_node_id())
> +	if (page_nid != numa_node_id())
>  		local_stat = NUMA_OTHER;
>  
> -	if (z->node == preferred_zone->node)
> -		__inc_numa_state(z, NUMA_HIT);
> +	if (page_nid == preferred_nid)
> +		inc_node_state(NODE_DATA(page_nid), NUMA_HIT);
>  	else {
> -		__inc_numa_state(z, NUMA_MISS);
> -		__inc_numa_state(preferred_zone, NUMA_FOREIGN);
> +		inc_node_state(NODE_DATA(page_nid), NUMA_MISS);
> +		inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
>  	}

Your patch saves more code than mine because the node stats framework is reused
for numa stats. But it has a performance regression because of the limitation of
threshold size (125 at most, see calculate_normal_threshold() in vmstat.c) 
in inc_node_state().

You can check this patch "1d90ca8 mm: update NUMA counter threshold size" for details.
This issue is reported by Jesper Dangaard Brouer originally.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>