The rstat_cpu and also rstat_css_list of the cgroup structure are read mostly variables. However, they may share the same cacheline as the subsequent rstat_flush_next and *bstat variables which can be updated frequently. That will slow down the cgroup_rstat_cpu() call which is called pretty frequently in the rstat code. Add a CACHELINE_PADDING() line in between them to avoid false cacheline sharing. A parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. Below were the lock hold time frequency distribution before and after the patch: Run time Before patch After patch -------- ------------ ----------- 0-01 us 14,594,545 15,484,707 01-05 us 439,926 207,382 05-10 us 5,960 3,174 10-15 us 3,543 3,006 15-20 us 1,397 1,066 20-25 us 25 15 25-30 us 12 10 It can be seen that the patch further pushes the lock hold time towards the lower end. Signed-off-by: Waiman Long <longman@xxxxxxxxxx> --- include/linux/cgroup-defs.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ff4b4c590f32..a4adc0580135 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -491,6 +491,13 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; + /* + * Add padding to separate the read mostly rstat_cpu and + * rstat_css_list into a different cacheline from the following + * rstat_flush_next and *bstat fields which can have frequent updates. + */ + CACHELINE_PADDING(_pad_); + /* * A singly-linked list of cgroup structures to be rstat flushed. * This is a scratch field to be used exclusively by -- 2.39.3