The patch titled Subject: percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing has been added to the -mm mm-unstable branch. Its filename is percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Yu Ma <yu.ma@xxxxxxxxx> Subject: percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing Date: Fri, 9 Jun 2023 23:07:30 -0400 When running UnixBench/Execl throughput case, false sharing is observed due to frequent read on base_addr and write on free_bytes, chunk_md. UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. It will do system call on execl frequently, and execl will call mm_init to initialize mm_struct of the process. mm_init will call __percpu_counter_init for percpu_counters initialization. Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a specified chunk. This function will update "free_bytes" and "chunk_md" to record the rest free bytes and other meta data for this chunk. Correspondingly, pcpu_free_area will also update these 2 members when free memory. Call trace from perf is as below: + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock - 53.54% 0x654278696e552f34 main __execve entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_execve do_execveat_common.isra.47 alloc_bprm mm_init __percpu_counter_init pcpu_alloc - __mutex_lock.isra.17 In current pcpu_chunk layout, `base_addr' is in the same cache line with `free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a new cacheline. With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.4-rc4, the 160 parallel score improves by 24%. Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@xxxxxxxxx Signed-off-by: Yu Ma <yu.ma@xxxxxxxxx> Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> Cc: Dennis Zhou <dennis@xxxxxxxxxx> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/percpu-internal.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) --- a/mm/percpu-internal.h~percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing +++ a/mm/percpu-internal.h @@ -41,10 +41,17 @@ struct pcpu_chunk { struct list_head list; /* linked to pcpu_slot lists */ int free_bytes; /* free bytes in the chunk */ struct pcpu_block_md chunk_md; - void *base_addr; /* base address of this chunk */ + unsigned long *bound_map; /* boundary map */ + + /* + * base_addr is the base address of this chunk. + * To reduce false sharing, current layout is optimized to make sure + * base_addr locate in the different cacheline with free_bytes and + * chunk_md. + */ + void *base_addr ____cacheline_aligned_in_smp; unsigned long *alloc_map; /* allocation map */ - unsigned long *bound_map; /* boundary map */ struct pcpu_block_md *md_blocks; /* metadata blocks */ void *data; /* chunk data */ _ Patches currently in -mm which might be from yu.ma@xxxxxxxxx are percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing.patch