On Thu, Apr 22, 2021 at 12:58:52PM -0400, Waiman Long wrote: > On 4/21/21 7:28 PM, Roman Gushchin wrote: > > On Tue, Apr 20, 2021 at 03:29:05PM -0400, Waiman Long wrote: > > > Before the new slab memory controller with per object byte charging, > > > charging and vmstat data update happen only when new slab pages are > > > allocated or freed. Now they are done with every kmem_cache_alloc() > > > and kmem_cache_free(). This causes additional overhead for workloads > > > that generate a lot of alloc and free calls. > > > > > > The memcg_stock_pcp is used to cache byte charge for a specific > > > obj_cgroup to reduce that overhead. To further reducing it, this patch > > > makes the vmstat data cached in the memcg_stock_pcp structure as well > > > until it accumulates a page size worth of update or when other cached > > > data change. Caching the vmstat data in the per-cpu stock eliminates two > > > writes to non-hot cachelines for memcg specific as well as memcg-lruvecs > > > specific vmstat data by a write to a hot local stock cacheline. > > > > > > On a 2-socket Cascade Lake server with instrumentation enabled and this > > > patch applied, it was found that about 20% (634400 out of 3243830) > > > of the time when mod_objcg_state() is called leads to an actual call > > > to __mod_objcg_state() after initial boot. When doing parallel kernel > > > build, the figure was about 17% (24329265 out of 142512465). So caching > > > the vmstat data reduces the number of calls to __mod_objcg_state() > > > by more than 80%. > > > > > > Signed-off-by: Waiman Long <longman@xxxxxxxxxx> > > > Reviewed-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > > > --- > > > mm/memcontrol.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++-- > > > 1 file changed, 83 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 7cd7187a017c..292b4783b1a7 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -782,8 +782,9 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) > > > rcu_read_unlock(); > > > } > > > -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, > > > - enum node_stat_item idx, int nr) > > > +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, > > > + struct pglist_data *pgdat, > > > + enum node_stat_item idx, int nr) > > > { > > > struct mem_cgroup *memcg; > > > struct lruvec *lruvec; > > > @@ -791,7 +792,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, > > > rcu_read_lock(); > > > memcg = obj_cgroup_memcg(objcg); > > > lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > - mod_memcg_lruvec_state(lruvec, idx, nr); > > > + __mod_memcg_lruvec_state(lruvec, idx, nr); > > > rcu_read_unlock(); > > > } > > > @@ -2059,7 +2060,10 @@ struct memcg_stock_pcp { > > > #ifdef CONFIG_MEMCG_KMEM > > > struct obj_cgroup *cached_objcg; > > > + struct pglist_data *cached_pgdat; > > I wonder if we want to have per-node counters instead? > > That would complicate the initialization of pcp stocks a bit, > > but might shave off some additional cpu time. > > But we can do it later too. > > > A per node counter will certainly complicate the code and reduce the > performance benefit too. Hm, why? We wouldn't need to flush the stock if the release happens on some other cpu not matching the current pgdat. > I got a pretty good hit rate of 80%+ with the > current code on a 2-socket system. The hit rate will probably drop when > there are more nodes. I will do some more investigation, but it will not be > for this patchset. Works for me! Thanks!