On Tue 29-12-20 22:35:14, Feng Tang wrote: > When profiling memory cgroup involved benchmarking, status update > sometimes take quite some CPU cycles. Current MEMCG_CHARGE_BATCH > is used for both charging and statistics/events updating, and is > set to 32, which may be good for accuracy of memcg charging, but > too small for stats update which causes concurrent access to global > stats data instead of per-cpu ones. > > So handle them differently, by adding a new bigger batch number > for stats updating, while keeping the value for charging (though > comments in memcontrol.h suggests to consider a bigger value too) > > The new batch is set to 512, which considers 2MB huge pages (512 > pages), as the check logic mostly is: > > if (x > BATCH), then skip updating global data > > so it will save 50% global data updating for 2MB pages Please note that there is a patch set to change THP accounting to be per page based (http://lkml.kernel.org/r/20201228164110.2838-1-songmuchun@xxxxxxxxxxxxx) which will change the current behavior already. Our batch size (MEMCG_CHARGE_BATCH) is quite arbitrary. I do not think anybody has ever seriously benchmarked the effect of the size. I am not opposed to changing that but I have to say I dislike the charge to diverge from counters in that respect. This just opens doors to weird effects IMO. Those two are quite related already. > Following are some performance data with the patch, against > v5.11-rc1, on several generations of Xeon platforms. Each category > below has several subcases run on different platform, and only the > worst and best scores are listed: > > fio: +2.0% ~ +6.8% > will-it-scale/malloc: -0.9% ~ +6.2% > will-it-scale/page_fault1: no change > will-it-scale/page_fault2: +13.7% ~ +26.2% > > One thought is it could be dynamically calculated according to > memcg limit and number of CPUs, and another is to add a periodic > syncing of the data for accuracy reason similar to vmstat, as > suggested by Ying. > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: Roman Gushchin <guro@xxxxxx> > --- > include/linux/memcontrol.h | 2 ++ > mm/memcontrol.c | 6 +++--- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d827bd7..d58bf28 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -335,6 +335,8 @@ struct mem_cgroup { > */ > #define MEMCG_CHARGE_BATCH 32U > > +#define MEMCG_UPDATE_BATCH 512U > + > extern struct mem_cgroup *root_mem_cgroup; > > enum page_memcg_data_flags { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 605f671..01ca85d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -760,7 +760,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - long x, threshold = MEMCG_CHARGE_BATCH; > + long x, threshold = MEMCG_UPDATE_BATCH; > > if (mem_cgroup_disabled()) > return; > @@ -800,7 +800,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, > { > struct mem_cgroup_per_node *pn; > struct mem_cgroup *memcg; > - long x, threshold = MEMCG_CHARGE_BATCH; > + long x, threshold = MEMCG_UPDATE_BATCH; > > pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > memcg = pn->memcg; > @@ -905,7 +905,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > return; > > x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); > - if (unlikely(x > MEMCG_CHARGE_BATCH)) { > + if (unlikely(x > MEMCG_UPDATE_BATCH)) { > struct mem_cgroup *mi; > > /* > -- > 2.7.4 > -- Michal Hocko SUSE Labs