On Thu, Mar 07, 2019 at 03:00:33PM -0800, Roman Gushchin wrote: > Spill percpu stats and events data to corresponding before releasing > percpu memory. > > Although per-cpu stats are never exactly precise, dropping them on > floor regularly may lead to an accumulation of an error. So, it's > safer to sync them before releasing. > > To minimize the number of atomic updates, let's sum all stats/events > on all cpus locally, and then make a single update per entry. > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > --- > mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 52 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 18e863890392..b7eb6fac735e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4612,11 +4612,63 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) > return 0; > } > > +/* > + * Spill all per-cpu stats and events into atomics. > + * Try to minimize the number of atomic writes by gathering data from > + * all cpus locally, and then make one atomic update. > + * No locking is required, because no one has an access to > + * the offlined percpu data. > + */ > +static void mem_cgroup_spill_offlined_percpu(struct mem_cgroup *memcg) > +{ > + struct memcg_vmstats_percpu __percpu *vmstats_percpu; > + struct lruvec_stat __percpu *lruvec_stat_cpu; > + struct mem_cgroup_per_node *pn; > + int cpu, i; > + long x; > + > + vmstats_percpu = memcg->vmstats_percpu_offlined; > + > + for (i = 0; i < MEMCG_NR_STAT; i++) { > + int nid; > + > + x = 0; > + for_each_possible_cpu(cpu) > + x += per_cpu(vmstats_percpu->stat[i], cpu); > + if (x) > + atomic_long_add(x, &memcg->vmstats[i]); > + > + if (i >= NR_VM_NODE_STAT_ITEMS) > + continue; > + > + for_each_node(nid) { > + pn = mem_cgroup_nodeinfo(memcg, nid); > + lruvec_stat_cpu = pn->lruvec_stat_cpu_offlined; > + > + x = 0; > + for_each_possible_cpu(cpu) > + x += per_cpu(lruvec_stat_cpu->count[i], cpu); > + if (x) > + atomic_long_add(x, &pn->lruvec_stat[i]); > + } > + } > + > + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > + x = 0; > + for_each_possible_cpu(cpu) > + x += per_cpu(vmstats_percpu->events[i], cpu); > + if (x) > + atomic_long_add(x, &memcg->vmevents[i]); > + } This looks good, but couldn't this be merged with the cpu offlining? It seems to be exactly the same code, except for the nesting of the for_each_possible_cpu() iteration here. This could be a function that takes a CPU argument and then iterates the cgroups and stat items to collect and spill the counters of that specified CPU; offlining would call it once, and this spill code here would call it for_each_possible_cpu(). We shouldn't need the atomicity of this_cpu_xchg() during hotunplug, the scheduler isn't even active on that CPU anymore when it's called.