Re: [PATCH 4.14] mm: memcontrol: fix excessive complexity in memory.stat reporting

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Mon, 28 Dec 2020 12:31:23 +0100

On Mon, Dec 21, 2020 at 07:35:31PM +0000, Shaoying Xu wrote:
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> 
> [ Upstream commit a983b5ebee57209c99f68c8327072f25e0e6e3da ]
> 
> We've seen memory.stat reads in top-level cgroups take up to fourteen
> seconds during a userspace bug that created tens of thousands of ghost
> cgroups pinned by lingering page cache.
> 
> Even with a more reasonable number of cgroups, aggregating memory.stat
> is unnecessarily heavy.  The complexity is this:
> 
> 	nr_cgroups * nr_stat_items * nr_possible_cpus
> 
> where the stat items are ~70 at this point.  With 128 cgroups and 128
> CPUs - decent, not enormous setups - reading the top-level memory.stat
> has to aggregate over a million per-cpu counters.  This doesn't scale.
> 
> Instead of spreading the source of truth across all CPUs, use the
> per-cpu counters merely to batch updates to shared atomic counters.
> 
> This is the same as the per-cpu stocks we use for charging memory to the
> shared atomic page_counters, and also the way the global vmstat counters
> are implemented.
> 
> Vmstat has elaborate spilling thresholds that depend on the number of
> CPUs, amount of memory, and memory pressure - carefully balancing the
> cost of counter updates with the amount of per-cpu error.  That's
> because the vmstat counters are system-wide, but also used for decisions
> inside the kernel (e.g.  NR_FREE_PAGES in the allocator).  Neither is
> true for the memory controller.
> 
> Use the same static batch size we already use for page_counter updates
> during charging.  The per-cpu error in the stats will be 128k, which is
> an acceptable ratio of cores to memory accounting granularity.
> 
> [hannes@xxxxxxxxxxx: fix warning in __this_cpu_xchg() calls]
>   Link: http://lkml.kernel.org/r/20171201135750.GB8097@xxxxxxxxxxx
> Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@xxxxxxxxxxx
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Acked-by: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx c9019e9: mm: memcontrol: eliminate raw access to stat and event counters
> Cc: stable@xxxxxxxxxxxxxxx 2845426: mm: memcontrol: implement lruvec stat functions on top of each other
> Cc: stable@xxxxxxxxxxxxxxx
> [shaoyi@xxxxxxxxxx: resolved the conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 in mm/memcontrol.c by contextual fix]
> Signed-off-by: Shaoying Xu <shaoyi@xxxxxxxxxx>
> ---
> The excessive complexity in memory.stat reporting was fixed in v4.16 but didn't appear to make it to 4.14 stable. When backporting this patch, there is a small conflict brought by commit 17ffa29c355658c8e9b19f56cbf0388500ca7905 within free_mem_cgroup_per_node_info() of mm/memcontrol.c and can be resolved by contextual fix.
> 
>  include/linux/memcontrol.h |  96 +++++++++++++++++++++++++++---------------
>  mm/memcontrol.c            | 101 +++++++++++++++++++++++----------------------
>  2 files changed, 113 insertions(+), 84 deletions(-)

This patch does not apply to the 4.14.y tree, please fix it up and
resend it if you wish to see it applied there.

thanks,

greg k-h