[no subject]

**Date** **Thread**

cr composes of all updates together (corresponds to stats_updates in
memcg_rstat_updated(), max_cr is change rate per counter)
  cr = Î£ cr_i <= nr_counters * max_cr 

By combining these two we get shortest time between flushes:
  cr * Î?t <= nr_counters * max_cr * Î?t
  nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Î?t
  Î?t >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr)

We are interested in 
  R_amort = flush_work / Î?t
which is
  R_amort <= flush_work * nr_counters * max_cr / (nr_cpus * MEMCG_CHARGE_BATCH)

R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters * max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) )
R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) / (MEMCG_CHARGE_BATCH) )

The square looks interesting given there are already tens of counters.
(As data from Ivan have shown, we can hardly restore the pre-rstat
performance on the read side even with mere mod_delayed_work().)
This is what you partially solved with introduction of NR_MEMCG_EVENTS
but the stats_updates was still sum of all events, so the flush might
have still triggered too frequently.

Maybe that would be better long-term approach, splitting into accurate
and approximate counters and reflect that in the error estimator stats_updates.

Or some other optimization of mem_cgroup_css_rstat_flush().