Re: memcgroup lruvec_lru_size scaling issue

Michal Hocko <mhocko@xxxxxxxxxx> · Wed, 16 Oct 2019 09:25:20 +0200

On Tue 15-10-19 13:38:31, Andrew Morton wrote:
> On Tue, 15 Oct 2019 08:19:20 +0200 Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> 
> > I dunno, but squashing those two changelogs sounds more confusing than
> > helpful to me. What about the folowing instead?
> 
> 
> 
> From: Honglei Wang <honglei.wang@xxxxxxxxxx>
> Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
> 
> 1a61ab8038e72 ("mm: memcontrol: replace zone summing with
> lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
> instead of calculating it directly from lru_zone_size with an idea that
> this would be more effective.  Tim has reported that this is not really
> the case for their database benchmark which is showing an opposite results
> where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
> of the system time which is roughly 7% of total cpu cycles) on 5.3
> kernels.  The workload is running on a larger machine (96cpus), it has
> many cgroups (500) and it is heavily direct reclaim bound.
> 
> Tim Chen said:
> 
> : The problem can also be reproduced by running simple multi-threaded
> : pmbench benchmark with a fast Optane SSD swap (see profile below).
> : 
> : 
> : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
> :             |
> :             |--3.07%--lruvec_lru_size
> :             |          |
> :             |          |--2.11%--cpumask_next
> :             |          |          |
> :             |          |           --1.66%--find_next_bit
> :             |          |
> :             |           --0.57%--call_function_interrupt
> :             |                     |
> :             |                      --0.55%--smp_call_function_interrupt
> :             |
> :             |--1.59%--0x441f0fc3d009
> :             |          _ops_rdtsc_init_base_freq
> :             |          access_histogram
> :             |          page_fault
> :             |          __do_page_fault
> :             |          handle_mm_fault
> :             |          __handle_mm_fault
> :             |          |
> :             |           --1.54%--do_swap_page
> :             |                     swapin_readahead
> :             |                     swap_cluster_readahead
> :             |                     |
> :             |                      --1.53%--read_swap_cache_async
> :             |                                __read_swap_cache_async
> :             |                                alloc_pages_vma
> :             |                                __alloc_pages_nodemask
> :             |                                __alloc_pages_slowpath
> :             |                                try_to_free_pages
> :             |                                do_try_to_free_pages
> :             |                                shrink_node
> :             |                                shrink_node_memcg
> :             |                                |
> :             |                                |--0.77%--lruvec_lru_size
> :             |                                |
> :             |                                 --0.76%--inactive_list_is_low
> :             |                                           |
> :             |                                            --0.76%--lruvec_lru_size
> :             |
> :              --1.50%--measure_read
> :                        page_fault
> :                        __do_page_fault
> :                        handle_mm_fault
> :                        __handle_mm_fault
> :                        do_swap_page
> :                        swapin_readahead
> :                        swap_cluster_readahead
> :                        |
> :                         --1.48%--read_swap_cache_async
> :                                   __read_swap_cache_async
> :                                   alloc_pages_vma
> :                                   __alloc_pages_nodemask
> :                                   __alloc_pages_slowpath
> :                                   try_to_free_pages
> :                                   do_try_to_free_pages
> :                                   shrink_node
> :                                   shrink_node_memcg
> :                                   |
> :                                   |--0.75%--inactive_list_is_low
> :                                   |          |
> :                                   |           --0.75%--lruvec_lru_size
> :                                   |
> :                                    --0.73%--lruvec_lru_size
> 
> 
> The likely culprit is the cache traffic the lruvec_page_state_local
> generates. Dave Hansen says:
> 
> : I was thinking purely of the cache footprint.  If it's reading
> : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
> : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
> : be 18k of data for the whole system and the caching would be pretty
> : efficient and all 18k would probably survive a tight page fault loop in
> : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
> : fit in the L1 and probably wouldn't survive a tight page fault loop if
> : both logical threads were banging on different cgroups.
> : 
> : It's just a theory, but it's why I noted the number of cgroups when I
> : initially saw this show up in profiles

Btw. that theory could be confirmed by an increased number of cache
misses IIUC. Tim, could you give it a try please?

> 
> Fix the regression by partially reverting the said commit and calculate
> the lru size explicitly.
> 
> Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@xxxxxxxxxx
> Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
> Signed-off-by: Honglei Wang <honglei.wang@xxxxxxxxxx>
> Reported-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Acked-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Tested-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Roman Gushchin <guro@xxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>	[5.2+]
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

Acked-by: Michal Hocko <mhocko@xxxxxxxx>

> ---
> 
>  mm/vmscan.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
> +++ a/mm/vmscan.c
> @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> _

-- 
Michal Hocko
SUSE Labs