RE: [PATCH] fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.

"Zhu, Lipeng" <lipeng.zhu@xxxxxxxxx> · Sun, 16 Jul 2023 14:48:00 +0000

Hi Andrew,

> -----Original Message-----
> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Sent: Monday, July 3, 2023 5:11 AM
> To: Zhu, Lipeng <lipeng.zhu@xxxxxxxxx>
> Cc: viro@xxxxxxxxxxxxxxxxxx; brauner@xxxxxxxxxx; linux-
> fsdevel@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> Deng, Pan <pan.deng@xxxxxxxxx>; Ma, Yu <yu.ma@xxxxxxxxx>; Li, Tianyou
> <tianyou.li@xxxxxxxxx>; tim.c.chen@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] fs/address_space: add alignment padding for i_map and
> i_mmap_rwsem to mitigate a false sharing.
> 
> On Wed, 28 Jun 2023 18:56:25 +0800 "Zhu, Lipeng" <lipeng.zhu@xxxxxxxxx>
> wrote:
> 
> > When running UnixBench/Shell Scripts, we observed high false sharing
> > for accessing i_mmap against i_mmap_rwsem.
> >
> > UnixBench/Shell Scripts are typical load/execute command test
> > scenarios, the i_mmap will be accessed frequently to insert/remove
> vma_interval_tree.
> > Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they
> > are in the same cacheline.
> 
> That sounds odd.  One would expect these two fields to be used in close
> conjunction, so any sharing might even be beneficial.  Can you identify in more
> detail what's actually going on in there?
> 

Yes, I'm running UnixBench/Shell Script which concurrently 
launch->execute->exit a lot of shell commands.

During the workload running:
1: A lot of processes invoke vma_interval_tree_remove which touch 
"i_mmap", the call stack:
----vma_interval_tree_remove
    |----unlink_file_vma
    |    free_pgtables
    |    |----exit_mmap
    |    |    mmput
    |    |    |----begin_new_exec
    |    |    |    load_elf_binary
    |    |    |    bprm_execve

2: Also, there are a lot of processes touch 'i_mmap_rwsem' to acquire 
the semaphore in order to access 'i_mmap'.
In existing 'address_space' layout, 'i_mmap' and 'i_mmap_rwsem' are in 
the same cacheline.
struct address_space {
    struct inode *             host;                 /*     0     8 */
    struct xarray              i_pages;              /*     8    16 */
    struct rw_semaphore        invalidate_lock;      /*    24    40 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    gfp_t                      gfp_mask;             /*    64     4 */
    atomic_t                   i_mmap_writable;      /*    68     4 */
    struct rb_root_cached      i_mmap;               /*    72    16 */
    struct rw_semaphore        i_mmap_rwsem;         /*    88    40 */
    /* --- cacheline 2 boundary (128 bytes) --- */
    long unsigned int          nrpages;              /*   128     8 */
    long unsigned int          writeback_index;      /*   136     8 */
    const struct address_space_operations  * a_ops;  /*   144     8 */
    long unsigned int          flags;                /*   152     8 */
    errseq_t                   wb_err;               /*   160     4 */
    spinlock_t                 private_lock;         /*   164     4 */
    struct list_head           private_list;         /*   168    16 */
    void *                     private_data;         /*   184     8 */

    /* size: 192, cachelines: 3, members: 15 */ };

Following perf c2c result shows heavy c2c bounce due to false sharing.
=================================================
      Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
    0    3729     5791        0        0  0xff19b3818445c740
-------------------------------------------------------------
   3.27%    3.02%    0.00%    0.00%   0x18     0       1  0xffffffffa194403b       604       483       389      692       203  [k] vma_interval_tree_insert    [kernel.kallsyms]  vma_interval_tree_insert+75      0  1
   4.13%    3.63%    0.00%    0.00%   0x20     0       1  0xffffffffa19440a2       553       413       415      962       215  [k] vma_interval_tree_remove    [kernel.kallsyms]  vma_interval_tree_remove+18      0  1
   2.04%    1.35%    0.00%    0.00%   0x28     0       1  0xffffffffa219a1d6      1210       855       460     1229       222  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+678    0  1
   0.62%    1.85%    0.00%    0.00%   0x28     0       1  0xffffffffa219a1bf       762       329       577      527       198  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+655    0  1
   0.48%    0.31%    0.00%    0.00%   0x28     0       1  0xffffffffa219a58c      1677      1476       733     1544       224  [k] down_write                  [kernel.kallsyms]  down_write+28                    0  1
   0.05%    0.07%    0.00%    0.00%   0x28     0       1  0xffffffffa219a21d      1040       819       689       33        27  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+749    0  1
   0.00%    0.05%    0.00%    0.00%   0x28     0       1  0xffffffffa17707db         0      1005       786     1373       223  [k] up_write                    [kernel.kallsyms]  up_write+27                      0  1
   0.00%    0.02%    0.00%    0.00%   0x28     0       1  0xffffffffa219a064         0       233       778       32        30  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+308    0  1
  33.82%   34.10%    0.00%    0.00%   0x30     0       1  0xffffffffa1770945       779       495       534     6011       224  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+53           0  1
  17.06%   15.28%    0.00%    0.00%   0x30     0       1  0xffffffffa1770915       593       438       468     2715       224  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+5            0  1
   3.54%    3.52%    0.00%    0.00%   0x30     0       1  0xffffffffa2199f84       881       601       583     1421       223  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+84     0  1

> > The patch places the i_mmap and i_mmap_rwsem in separate cache lines
> > to avoid this false sharing problem.
> >
> > With this patch, on Intel Sapphire Rapids 2 sockets 112c/224t
> > platform, based on kernel v6.4-rc4, the 224 parallel score is improved
> > ~2.5% for UnixBench/Shell Scripts case. And perf c2c tool shows the
> > false sharing is resolved as expected, the symbol
> > vma_interval_tree_remove disappeared in cache line 0 after this change.
> 
> There can be many address_spaces in memory, so a size increase is a concern.
> Is there anything we can do to minimize the cost of this?

Thanks for your reminder of the memory size increased. After the padding, the
struct "address_space" need 4 cachelines to fill in, the memory size increased is
around ~33%.
Then I tried another approach by moving 'i_mmap_rwsem' under the field
'flags', no memory size increased for "address_space" and based on v6.4.0, on
Intel Sapphire Rapids 112C/224T platform, the score improves by ~5.3%. From
the perf c2c record data, the false sharing has been fixed for 'i_mmap' and
'i_mmap_rwsem'.

=================================================
      Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
   0      556      838        0        0  0xff2780d7965d2780
-------------------------------------------------------------
    0.18%    0.60%    0.00%    0.00%    0x8     0       1  0xffffffffafff27b8       503       453       569       14        13  [k] do_dentry_open              [kernel.kallsyms]  do_dentry_open+456               0  1
    0.54%    0.12%    0.00%    0.00%    0x8     0       1  0xffffffffaffc51ac       510       199       428       15        12  [k] hugepage_vma_check          [kernel.kallsyms]  hugepage_vma_check+252           0  1
    1.80%    2.15%    0.00%    0.00%   0x18     0       1  0xffffffffb079a1d6      1778       799       343      215       136  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+678    0  1
    0.54%    1.31%    0.00%    0.00%   0x18     0       1  0xffffffffb079a1bf       547       296       528       91        71  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+655    0  1
    0.72%    0.72%    0.00%    0.00%   0x18     0       1  0xffffffffb079a58c      1479      1534       676      288       163  [k] down_write                  [kernel.kallsyms]  down_write+28                    0  1
    0.00%    0.12%    0.00%    0.00%   0x18     0       1  0xffffffffafd707db         0      2381       744      282       158  [k] up_write                    [kernel.kallsyms]  up_write+27                      0  1
    0.00%    0.12%    0.00%    0.00%   0x18     0       1  0xffffffffb079a064         0       239       518        6         6  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem_down_write_slowpath+308    0  1
   46.58%   47.02%    0.00%    0.00%   0x20     0       1  0xffffffffafd70945       704       403       499     1137       219  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+53           0  1
   23.92%   25.78%    0.00%    0.00%   0x20     0       1  0xffffffffafd70915       558       413       500      542       185  [k] rwsem_spin_on_owner         [kernel.kallsyms]  rwsem_spin_on_owner+5            0  1

 I will send another patch out and update the commit log.

Lipeng Zhu,
Best Regards