The quilt patch titled Subject: percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing has been removed from the -mm tree. Its filename was percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yu Ma <yu.ma@xxxxxxxxx> Subject: percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing Date: Fri, 9 Jun 2023 23:07:30 -0400 When running UnixBench/Execl throughput case, false sharing is observed due to frequent read on base_addr and write on free_bytes, chunk_md. UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. It will do system call on execl frequently, and execl will call mm_init to initialize mm_struct of the process. mm_init will call __percpu_counter_init for percpu_counters initialization. Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a specified chunk. This function will update "free_bytes" and "chunk_md" to record the rest free bytes and other meta data for this chunk. Correspondingly, pcpu_free_area will also update these 2 members when free memory. Call trace from perf is as below: + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock - 53.54% 0x654278696e552f34 main __execve entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_execve do_execveat_common.isra.47 alloc_bprm mm_init __percpu_counter_init pcpu_alloc - __mutex_lock.isra.17 In current pcpu_chunk layout, `base_addr' is in the same cache line with `free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a new cacheline. With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.4-rc4, the 160 parallel score improves by 24%. The pcpu_chunk struct is a backing data structure per chunk, so the additional memory should not be dramatic. A chunk covers ballpark between 64kb and 512kb memory depending on some config and boot time stuff, so I believe the additional memory used here is nominal at best. Working the #s on my desktop: Percpu: 58624 kB 28 cores -> ~2.1MB of percpu memory. At say ~128KB per chunk -> 33 chunks, generously 40 chunks. Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of overhead? I believe we can do a little better to avoid eating that full padding, so likely less than that. [dennis@xxxxxxxxxx: changelog details] Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@xxxxxxxxx Signed-off-by: Yu Ma <yu.ma@xxxxxxxxx> Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> Acked-by: Dennis Zhou <dennis@xxxxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/percpu-internal.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) --- a/mm/percpu-internal.h~percpu-internal-pcpu_chunk-re-layout-pcpu_chunk-structure-to-reduce-false-sharing +++ a/mm/percpu-internal.h @@ -41,10 +41,17 @@ struct pcpu_chunk { struct list_head list; /* linked to pcpu_slot lists */ int free_bytes; /* free bytes in the chunk */ struct pcpu_block_md chunk_md; - void *base_addr; /* base address of this chunk */ + unsigned long *bound_map; /* boundary map */ + + /* + * base_addr is the base address of this chunk. + * To reduce false sharing, current layout is optimized to make sure + * base_addr locate in the different cacheline with free_bytes and + * chunk_md. + */ + void *base_addr ____cacheline_aligned_in_smp; unsigned long *alloc_map; /* allocation map */ - unsigned long *bound_map; /* boundary map */ struct pcpu_block_md *md_blocks; /* metadata blocks */ void *data; /* chunk data */ _ Patches currently in -mm which might be from yu.ma@xxxxxxxxx are