Re: memcgroup lruvec_lru_size scaling issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon 14-10-19 15:14:30, Andrew Morton wrote:
[...]
> From: Honglei Wang <honglei.wang@xxxxxxxxxx>
> Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
> 
> lruvec_lru_size() is invokving lruvec_page_state_local() to get the
> lru_size.  It's base on lruvec_stat_local.count[] of mem_cgroup_per_node. 
> This counter is updated in a batched way.  It won't be charged if the
> number of incoming pages doesn't meet the needs of MEMCG_CHARGE_BATCH
> which is defined as 32.
> 
> The testcase in LTP madvise09[1] fails because small blocks of memory are
> not charged.  It creates a new memcgroup and sets up 32 MADV_FREE pages. 
> Then it forks a child who will introduce memory pressure in the memcgroup.
> The MADV_FREE pages are expected to be released under the pressure, but
> 32 is not more than MEMCG_CHARGE_BATCH and these pages won't be charged in
> lruvec_stat_local.count[] until some more pages come in to satisfy the
> needs of batch charging.  So these MADV_FREE pages can't be freed in
> memory pressure which is a bit conflicted with the definition of
> MADV_FREE.
> 
> Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is not
> updated via batching can making it more accurate in this scenario.
> 
> This is effectively a partial reversion of 1a61ab8038e72 ("mm: memcontrol:
> replace zone summing with lruvec_page_state()").
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> 
> Tim said:
> 
> : We were running a database benchmark in mem cgroup and found that
> : lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our
> : kernel time - bout 7% of total cpu cycles) on 5.3 kernel.
> : 
> : The main issue is the loop in lruvec_page_state_local called by
> : lruvec_lru_size in the mem cgroup path:
> : 
> : for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx],
> : cpu); It is costly looping through all the cpus to get the lru vec size
> : info.  And doing this on our workload with 96 cpu threads and 500 mem
> : cgroups makes things much worse.  We might end up having 96 cpus * 500
> : cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be
> : running through all the time.
> : 
> : Hongwei's patch restores the previous method for computing lru_size and is
> : much more efficient in getting the lru_size.  We got a 20% throughput
> : improvement in our database benchmark with Hongwei's patch, and
> : lruvec_lru_size's cpu overhead completely disappeared from the cpu
> : profile.

I dunno, but squashing those two changelogs sounds more confusing than
helpful to me. What about the folowing instead?
"
1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective. Tim has reported that this is not really
the case for their database benchmark which is showing an opposite
results where lruvec_page_state is taking up a huge chunk of CPU cycles
(about 25% of the system time which is roughly 7% of total cpu cycles)
on 5.3 kernels. The workload is running on a larger machine (96cpus),
it has many cgroups (500) and it is heavily direct reclaim bound.

<Tim's perf profiles goes here>

The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint.  If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
: 
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles

Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
"

> Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@xxxxxxxxxx
> Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
> Signed-off-by: Honglei Wang <honglei.wang@xxxxxxxxxx>
> Reported-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Roman Gushchin <guro@xxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  mm/vmscan.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
> +++ a/mm/vmscan.c
> @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> _

-- 
Michal Hocko
SUSE Labs




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux