Re: [PATCH] mm: memcontrol: flush slab vmstats on kmem offlining

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 8 Aug 2019 16:02:57 -0700

On Thu, 8 Aug 2019 21:47:11 +0000 Roman Gushchin <guro@xxxxxx> wrote:

> On Thu, Aug 08, 2019 at 02:21:46PM -0700, Andrew Morton wrote:
> > On Thu, 8 Aug 2019 13:36:04 -0700 Roman Gushchin <guro@xxxxxx> wrote:
> > 
> > > I've noticed that the "slab" value in memory.stat is sometimes 0,
> > > even if some children memory cgroups have a non-zero "slab" value.
> > > The following investigation showed that this is the result
> > > of the kmem_cache reparenting in combination with the per-cpu
> > > batching of slab vmstats.
> > > 
> > > At the offlining some vmstat value may leave in the percpu cache,
> > > not being propagated upwards by the cgroup hierarchy. It means
> > > that stats on ancestor levels are lower than actual. Later when
> > > slab pages are released, the precise number of pages is substracted
> > > on the parent level, making the value negative. We don't show negative
> > > values, 0 is printed instead.
> > > 
> > > To fix this issue, let's flush percpu slab memcg and lruvec stats
> > > on memcg offlining. This guarantees that numbers on all ancestor
> > > levels are accurate and match the actual number of outstanding
> > > slab pages.
> > > 
> > 
> > Looks expensive.  How frequently can these functions be called?
> 
> Once per memcg lifetime.

iirc there are some workloads in which this can be rapid?

> > > +	for_each_node(node)
> > > +		memcg_flush_slab_node_stats(memcg, node);
> > 
> > This loops across all possible CPUs once for each possible node.  Ouch.
> > 
> > Implementing hotplug handlers in here (which is surprisingly simple)
> > brings this down to num_online_nodes * num_online_cpus which is, I
> > think, potentially vastly better.
> >
> 
> Hm, maybe I'm biased because we don't play much with offlining, and
> don't have many NUMA nodes. What's the real world scenario? Disabling
> hyperthreading?

I assume it's machines which could take a large number of CPUs but in
fact have few.  I've asked this in response to many patches down the
ages and have never really got a clear answer.

A concern is that if such machines do exist, it will take a long time
for the regression reports to get to us.  Especially if such machines
are rare.

> Idk, given that it happens once per memcg lifetime, and memcg destruction
> isn't cheap anyway, I'm not sure it worth it. But if you are, I'm happy
> to add hotplug handlers.

I think it's worth taking a look.  As I mentioned, it can turn out to
be stupidly simple.

> I also thought about merging per-memcg stats and per-memcg-per-node stats
> (reading part can aggregate over 2? 4? numa nodes each time). That will
> make everything overall cheaper. But it's a separate topic.