On Thu, 8 Aug 2019 21:47:11 +0000 Roman Gushchin <guro@xxxxxx> wrote: > On Thu, Aug 08, 2019 at 02:21:46PM -0700, Andrew Morton wrote: > > On Thu, 8 Aug 2019 13:36:04 -0700 Roman Gushchin <guro@xxxxxx> wrote: > > > > > I've noticed that the "slab" value in memory.stat is sometimes 0, > > > even if some children memory cgroups have a non-zero "slab" value. > > > The following investigation showed that this is the result > > > of the kmem_cache reparenting in combination with the per-cpu > > > batching of slab vmstats. > > > > > > At the offlining some vmstat value may leave in the percpu cache, > > > not being propagated upwards by the cgroup hierarchy. It means > > > that stats on ancestor levels are lower than actual. Later when > > > slab pages are released, the precise number of pages is substracted > > > on the parent level, making the value negative. We don't show negative > > > values, 0 is printed instead. > > > > > > To fix this issue, let's flush percpu slab memcg and lruvec stats > > > on memcg offlining. This guarantees that numbers on all ancestor > > > levels are accurate and match the actual number of outstanding > > > slab pages. > > > > > > > Looks expensive. How frequently can these functions be called? > > Once per memcg lifetime. iirc there are some workloads in which this can be rapid? > > > + for_each_node(node) > > > + memcg_flush_slab_node_stats(memcg, node); > > > > This loops across all possible CPUs once for each possible node. Ouch. > > > > Implementing hotplug handlers in here (which is surprisingly simple) > > brings this down to num_online_nodes * num_online_cpus which is, I > > think, potentially vastly better. > > > > Hm, maybe I'm biased because we don't play much with offlining, and > don't have many NUMA nodes. What's the real world scenario? Disabling > hyperthreading? I assume it's machines which could take a large number of CPUs but in fact have few. I've asked this in response to many patches down the ages and have never really got a clear answer. A concern is that if such machines do exist, it will take a long time for the regression reports to get to us. Especially if such machines are rare. > Idk, given that it happens once per memcg lifetime, and memcg destruction > isn't cheap anyway, I'm not sure it worth it. But if you are, I'm happy > to add hotplug handlers. I think it's worth taking a look. As I mentioned, it can turn out to be stupidly simple. > I also thought about merging per-memcg stats and per-memcg-per-node stats > (reading part can aggregate over 2? 4? numa nodes each time). That will > make everything overall cheaper. But it's a separate topic.