Hi Andrew, > -----Original Message----- > From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Sent: Monday, July 3, 2023 5:11 AM > To: Zhu, Lipeng <lipeng.zhu@xxxxxxxxx> > Cc: viro@xxxxxxxxxxxxxxxxxx; brauner@xxxxxxxxxx; linux- > fsdevel@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > Deng, Pan <pan.deng@xxxxxxxxx>; Ma, Yu <yu.ma@xxxxxxxxx>; Li, Tianyou > <tianyou.li@xxxxxxxxx>; tim.c.chen@xxxxxxxxxxxxxxx > Subject: Re: [PATCH] fs/address_space: add alignment padding for i_map and > i_mmap_rwsem to mitigate a false sharing. > > On Wed, 28 Jun 2023 18:56:25 +0800 "Zhu, Lipeng" <lipeng.zhu@xxxxxxxxx> > wrote: > > > When running UnixBench/Shell Scripts, we observed high false sharing > > for accessing i_mmap against i_mmap_rwsem. > > > > UnixBench/Shell Scripts are typical load/execute command test > > scenarios, the i_mmap will be accessed frequently to insert/remove > vma_interval_tree. > > Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they > > are in the same cacheline. > > That sounds odd. One would expect these two fields to be used in close > conjunction, so any sharing might even be beneficial. Can you identify in more > detail what's actually going on in there? > Yes, I'm running UnixBench/Shell Script which concurrently launch->execute->exit a lot of shell commands. During the workload running: 1: A lot of processes invoke vma_interval_tree_remove which touch "i_mmap", the call stack: ----vma_interval_tree_remove |----unlink_file_vma | free_pgtables | |----exit_mmap | | mmput | | |----begin_new_exec | | | load_elf_binary | | | bprm_execve 2: Also, there are a lot of processes touch 'i_mmap_rwsem' to acquire the semaphore in order to access 'i_mmap'. In existing 'address_space' layout, 'i_mmap' and 'i_mmap_rwsem' are in the same cacheline. struct address_space { struct inode * host; /* 0 8 */ struct xarray i_pages; /* 8 16 */ struct rw_semaphore invalidate_lock; /* 24 40 */ /* --- cacheline 1 boundary (64 bytes) --- */ gfp_t gfp_mask; /* 64 4 */ atomic_t i_mmap_writable; /* 68 4 */ struct rb_root_cached i_mmap; /* 72 16 */ struct rw_semaphore i_mmap_rwsem; /* 88 40 */ /* --- cacheline 2 boundary (128 bytes) --- */ long unsigned int nrpages; /* 128 8 */ long unsigned int writeback_index; /* 136 8 */ const struct address_space_operations * a_ops; /* 144 8 */ long unsigned int flags; /* 152 8 */ errseq_t wb_err; /* 160 4 */ spinlock_t private_lock; /* 164 4 */ struct list_head private_list; /* 168 16 */ void * private_data; /* 184 8 */ /* size: 192, cachelines: 3, members: 15 */ }; Following perf c2c result shows heavy c2c bounce due to false sharing. ================================================= Shared Cache Line Distribution Pareto ================================================= ------------------------------------------------------------- 0 3729 5791 0 0 0xff19b3818445c740 ------------------------------------------------------------- 3.27% 3.02% 0.00% 0.00% 0x18 0 1 0xffffffffa194403b 604 483 389 692 203 [k] vma_interval_tree_insert [kernel.kallsyms] vma_interval_tree_insert+75 0 1 4.13% 3.63% 0.00% 0.00% 0x20 0 1 0xffffffffa19440a2 553 413 415 962 215 [k] vma_interval_tree_remove [kernel.kallsyms] vma_interval_tree_remove+18 0 1 2.04% 1.35% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1d6 1210 855 460 1229 222 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.62% 1.85% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1bf 762 329 577 527 198 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.48% 0.31% 0.00% 0.00% 0x28 0 1 0xffffffffa219a58c 1677 1476 733 1544 224 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.05% 0.07% 0.00% 0.00% 0x28 0 1 0xffffffffa219a21d 1040 819 689 33 27 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+749 0 1 0.00% 0.05% 0.00% 0.00% 0x28 0 1 0xffffffffa17707db 0 1005 786 1373 223 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.02% 0.00% 0.00% 0x28 0 1 0xffffffffa219a064 0 233 778 32 30 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 33.82% 34.10% 0.00% 0.00% 0x30 0 1 0xffffffffa1770945 779 495 534 6011 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 17.06% 15.28% 0.00% 0.00% 0x30 0 1 0xffffffffa1770915 593 438 468 2715 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1 3.54% 3.52% 0.00% 0.00% 0x30 0 1 0xffffffffa2199f84 881 601 583 1421 223 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+84 0 1 > > The patch places the i_mmap and i_mmap_rwsem in separate cache lines > > to avoid this false sharing problem. > > > > With this patch, on Intel Sapphire Rapids 2 sockets 112c/224t > > platform, based on kernel v6.4-rc4, the 224 parallel score is improved > > ~2.5% for UnixBench/Shell Scripts case. And perf c2c tool shows the > > false sharing is resolved as expected, the symbol > > vma_interval_tree_remove disappeared in cache line 0 after this change. > > There can be many address_spaces in memory, so a size increase is a concern. > Is there anything we can do to minimize the cost of this? Thanks for your reminder of the memory size increased. After the padding, the struct "address_space" need 4 cachelines to fill in, the memory size increased is around ~33%. Then I tried another approach by moving 'i_mmap_rwsem' under the field 'flags', no memory size increased for "address_space" and based on v6.4.0, on Intel Sapphire Rapids 112C/224T platform, the score improves by ~5.3%. From the perf c2c record data, the false sharing has been fixed for 'i_mmap' and 'i_mmap_rwsem'. ================================================= Shared Cache Line Distribution Pareto ================================================= ------------------------------------------------------------- 0 556 838 0 0 0xff2780d7965d2780 ------------------------------------------------------------- 0.18% 0.60% 0.00% 0.00% 0x8 0 1 0xffffffffafff27b8 503 453 569 14 13 [k] do_dentry_open [kernel.kallsyms] do_dentry_open+456 0 1 0.54% 0.12% 0.00% 0.00% 0x8 0 1 0xffffffffaffc51ac 510 199 428 15 12 [k] hugepage_vma_check [kernel.kallsyms] hugepage_vma_check+252 0 1 1.80% 2.15% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1d6 1778 799 343 215 136 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.54% 1.31% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1bf 547 296 528 91 71 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.72% 0.72% 0.00% 0.00% 0x18 0 1 0xffffffffb079a58c 1479 1534 676 288 163 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffafd707db 0 2381 744 282 158 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffb079a064 0 239 518 6 6 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 46.58% 47.02% 0.00% 0.00% 0x20 0 1 0xffffffffafd70945 704 403 499 1137 219 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 23.92% 25.78% 0.00% 0.00% 0x20 0 1 0xffffffffafd70915 558 413 500 542 185 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1 I will send another patch out and update the commit log. Lipeng Zhu, Best Regards