From: ZhangPeng <zhangpeng362@xxxxxxxxxx> Since commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"), the rss_stats have converted into percpu_counter, which convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a performance regression on fork/exec/shell. Even after commit 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters for mm struct"), the performance of fork/exec/shell is still poor compared to previous kernel versions. To mitigate performance regression, we use lazy_percpu_counter[1] to delay the allocation of percpu memory for rss_stats. After lmbench test, we will get 3% ~ 6% performance improvement for lmbench fork_proc/exec_proc/shell_proc after conversion. The test results are as follows: base base+revert base+lazy_percpu_counter fork_proc 427.4ms 394.1ms (7.8%) 413.9ms (3.2%) exec_proc 2205.1ms 2042.2ms (7.4%) 2072.0ms (6.0%) shell_proc 3180.9ms 2963.7ms (6.8%) 3010.7ms (5.4%) This solution has not been fully evaluated and tested. The main idea of this RFC patch series is to get the community's opinion on this approach. [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@xxxxxxxxxx/ Kent Overstreet (1): Lazy percpu counters ZhangPeng (2): lazy_percpu_counter: include struct percpu_counter in struct lazy_percpu_counter mm: convert mm's rss stats into lazy_percpu_counter include/linux/lazy-percpu-counter.h | 88 +++++++++++++++++++ include/linux/mm.h | 8 +- include/linux/mm_types.h | 4 +- include/trace/events/kmem.h | 4 +- kernel/fork.c | 12 +-- lib/Makefile | 2 +- lib/lazy-percpu-counter.c | 131 ++++++++++++++++++++++++++++ 7 files changed, 232 insertions(+), 17 deletions(-) create mode 100644 include/linux/lazy-percpu-counter.h create mode 100644 lib/lazy-percpu-counter.c -- 2.25.1