Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Tue, 15 Aug 2017 12:36:36 +0200

On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang <kemi.wang@xxxxxxxxx> wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
                                             ^^^^^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
 http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
> 
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful :-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
> 
> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
                                 ^^^^^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
>    Threshold   CPU cycles    Throughput(88 threads)
>       32        799         241760478
>       64        640         301628829
>       125       537         358906028 <==> system by default
>       256       468         412397590
>       512       428         450550704
>       4096      399         482520943
>       20000     394         489009617
>       30000     395         488017817
>       32765     394(-26.6%) 488932078(+36.2%) <==> with this patchset
>       N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Kemi Wang (2):
>   mm: Change the call sites of numa statistics items
>   mm: Update NUMA counter threshold size
> 
>  drivers/base/node.c    |  22 ++++---
>  include/linux/mmzone.h |  25 +++++---
>  include/linux/vmstat.h |  33 ++++++++++
>  mm/page_alloc.c        |  10 +--
>  mm/vmstat.c            | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 25 deletions(-)
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>