RE: [PATCH] fs.h: Optimize file struct to prevent false sharing

"Chen, Zhiyin" <zhiyin.chen@xxxxxxxxx> · Fri, 2 Jun 2023 11:01:53 +0000

> > Good questions.
> > perf has been applied to analyze the performance. In the syscall test,
> > the patch can reduce the CPU cycles for filp_close. Besides, the HITM
> > count is also reduced from
> > 43182 to 33146.
> > The test is not restricted to a set of adjacent cores. The numactl
> > command is only used to limit the number of processing cores.
> 
> And, in doing so, it limits the physical locality of the cores being used to 3-18.
> That effectively puts them all on the socket because the test is not using all
> 16 CPUs and the scheduler tends to put all related tasks on the same socket if
> there are enoguh idle CPUs to do so....
> 
> > In most situations, only 8/16/32 CPU
> > cores are used. Performance improvement is still obvious, even if
> > non-adjacent CPU cores are used.
> >
> > No matter what CPU type, cache size, or architecture, false sharing is
> > always negative on performance. And the read mostly members should be
> put together.
> >
> > To further prove the updated layout effectiveness on some other codes
> > path, results of fsdisk, fsbuffer, and fstime are also shown in the new
> commit message.
> >
> > Actually, the new layout can only reduce false sharing in high-contention
> situations.
> > The performance gain is not obvious, if there are some other
> > bottlenecks. For instance, if the cores are spread across multiple
> > sockets, memory access may be the new bottleneck due to NUMA.
> >
> > Here are the results across NUMA nodes. The patch has no negative
> > effect on the performance result.
> >
> > Command:  numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime
> > fsdisk fsbuffer With Patch Benchmark Run: Thu Jun 01 2023 03:13:52 -
> > 03:23:15
> > 224 CPUs in system; running 16 parallel copies of tests
> >
> > File Copy 1024 bufsize 2000 maxblocks        589958.6 KBps  (30.0 s, 2 samples)
> > File Copy 256 bufsize 500 maxblocks          148779.2 KBps  (30.0 s, 2 samples)
> > File Copy 4096 bufsize 8000 maxblocks       1968023.8 KBps  (30.0 s, 2 samples)
> > System Call Overhead                        5804316.1 lps   (10.0 s, 7 samples)
> 
> Ok, so very small data buffers and file sizes which means the working set of
> the benchmark is almost certainly going to be CPU cache resident.
> 
> This is a known problem with old IO benchmarks on modern CPUs - the data
> set is small enough that it often fits mostly in the CPU cache and so small
> variations in code layout can make 20-30% difference in performance for file
> copy benchmarks. Use a different compiler, or even a different filesystem,
> and the amazing gain goes away and may even result in a regression....
> 
> For example, this has been a known problem with IOZone for at least
> 15 years now, making it largely unreliable as a benchmarking tool.
> Unless, of course, you know exactly what you are doing and can avoid all the
> tests that are susceptible to CPU cache residency variations....
> 
> > System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> > File Copy 1024 bufsize 2000 maxblocks          3960.0     589958.6   1489.8
> > File Copy 256 bufsize 500 maxblocks            1655.0     148779.2    899.0
> > File Copy 4096 bufsize 8000 maxblocks          5800.0    1968023.8   3393.1
> > System Call Overhead                          15000.0    5804316.1   3869.5
> >                                                                    ========
> > System Benchmarks Index Score (Partial Only)                         2047.8
> >
> > Without Patch
> > Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
> > 224 CPUs in system; running 16 parallel copies of tests
> >
> > File Copy 1024 bufsize 2000 maxblocks        571829.9 KBps  (30.0 s, 2 samples)
> > File Copy 256 bufsize 500 maxblocks          147693.8 KBps  (30.0 s, 2 samples)
> > File Copy 4096 bufsize 8000 maxblocks       1938854.5 KBps  (30.0 s, 2 samples)
> > System Call Overhead                        5791936.3 lps   (10.0 s, 7 samples)
> >
> > System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> > File Copy 1024 bufsize 2000 maxblocks          3960.0     571829.9   1444.0
> > File Copy 256 bufsize 500 maxblocks            1655.0     147693.8    892.4
> > File Copy 4096 bufsize 8000 maxblocks          5800.0    1938854.5   3342.9
> > System Call Overhead                          15000.0    5791936.3   3861.3
> >                                                                    ========
> > System Benchmarks Index Score (Partial Only)                         2019.5
> 
> Yeah, that's what I thought we'd see. i.e. as soon as we go off-socket, there's
> no actual performance change. This generally means there is no difference in
> cacheline sharing across CPUs between the two tests. You can likely use `perf
> stat` to confirm this from the hardware l1/l2/llc data cache miss counters; I'd
> guess they are nearly identical with/without the patch.
> 
> If this truly was a false cacheline sharing situation, the cross-socket test
> results should measurably increase in perofrmance as the frequently
> accessed read-only data cacheline is shared across all CPU caches instead of
> being bounced exclusively between CPUs.
> The amount of l1/l2/llc data cache misses during the workload should reduce
> measurably if this is happening.
> 
> As a technical note, if you want to split data out into different cachelines, you
> should be using annotations like '____cacheline_aligned_in_smp' to align
> structures and variables inside structures to the start of a new cacheline. Not
> only is this self documenting, it will pad the structure appropriately to ensure
> that the update-heavy variable(s) you want isolated to a new cacheline are
> actually on a separate cacheline.  It may be that the manual cacheline
> separation isn't quite good enough to show improvement on multi-socket
> machines, so improving the layout via explicit alignment directives may show
> further improvement.
> 
> FYI, here's an example of how avoiding false sharing should improve
> performance when we go off-socket. Here's a comparison of the same 16-
> way workload, one on a 2x8p dual socket machine (machine A), the other
> running on a single 16p CPU core (machine B). The workload used 99% of all
> available CPU doing bulk file removal.
> 
> commit b0dff466c00975a3e3ec97e6b0266bfd3e4805d6
> Author: Dave Chinner <mailto:dchinner@xxxxxxxxxx>
> Date:   Wed May 20 13:17:11 2020 -0700
> 
>     xfs: separate read-only variables in struct xfs_mount
> 
>     Seeing massive cpu usage from xfs_agino_range() on one machine;
>     instruction level profiles look similar to another machine running
>     the same workload, only one machine is consuming 10x as much CPU as
>     the other and going much slower. The only real difference between
>     the two machines is core count per socket. Both are running
>     identical 16p/16GB virtual machine configurations
> 
>     Machine A:
> 
>       25.83%  [k] xfs_agino_range
>       12.68%  [k] __xfs_dir3_data_check
>        6.95%  [k] xfs_verify_ino
>        6.78%  [k] xfs_dir2_data_entry_tag_p
>        3.56%  [k] xfs_buf_find
>        2.31%  [k] xfs_verify_dir_ino
>        2.02%  [k] xfs_dabuf_map.constprop.0
>        1.65%  [k] xfs_ag_block_count
> 
>     And takes around 13 minutes to remove 50 million inodes.
> 
>     Machine B:
> 
>       13.90%  [k] __pv_queued_spin_lock_slowpath
>        3.76%  [k] do_raw_spin_lock
>        2.83%  [k] xfs_dir3_leaf_check_int
>        2.75%  [k] xfs_agino_range
>        2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
>        2.18%  [k] __xfs_dir3_data_check
>        2.02%  [k] xfs_log_commit_cil
> 
>     And takes around 5m30s to remove 50 million inodes.
> 
>     Suspect is cacheline contention on m_sectbb_log which is used in one
>     of the macros in xfs_agino_range. This is a read-only variable but
>     shares a cacheline with m_active_trans which is a global atomic that
>     gets bounced all around the machine.
> 
>     The workload is trying to run hundreds of thousands of transactions
>     per second and hence cacheline contention will be occurring on this
>     atomic counter. Hence xfs_agino_range() is likely just be an
>     innocent bystander as the cache coherency protocol fights over the
>     cacheline between CPU cores and sockets.
> 
>     On machine A, this rearrangement of the struct xfs_mount
>     results in the profile changing to:
> 
>        9.77%  [kernel]  [k] xfs_agino_range
>        6.27%  [kernel]  [k] __xfs_dir3_data_check
>        5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>        4.54%  [kernel]  [k] xfs_buf_find
>        3.79%  [kernel]  [k] do_raw_spin_lock
>        3.39%  [kernel]  [k] xfs_verify_ino
>        2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> 
>     Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
>     of machine B and still runs substantially slower than it should.
> 
>     Current rm -rf of 50 million files:
> 
>                     vanilla         patched
>     machine A       13m20s          6m42s
>     machine B       5m30s           5m02s
> 
>     It's an improvement, hence indicating that separation and further
>     optimisation of read-only global filesystem data is worthwhile, but
>     it clearly isn't the underlying issue causing this specific
>     performance degradation.
> 
>     Signed-off-by: Dave Chinner <mailto:dchinner@xxxxxxxxxx>
>     Reviewed-by: Christoph Hellwig <mailto:hch@xxxxxx>
>     Reviewed-by: Darrick J. Wong <mailto:darrick.wong@xxxxxxxxxx>
>     Signed-off-by: Darrick J. Wong <mailto:darrick.wong@xxxxxxxxxx>
> 
> Notice how much of an improvement occurred on the 2x8p system vs a
> single 16p core when the false sharing was removed? The 16p core showed
> ~10% reduction in CPU time, whilst the 2x8p showed a 50% reduction in CPU
> time. That's the sort of gains I'd expect if false sharing was an issue for this
> workload. The lack of multi-socket performance improvement tends to
> indicate that false sharing is not occurring and that something else has
> resulted in the single socket performance increases....
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> mailto:david@xxxxxxxxxxxxx

Thanks for your questions and comments.

Certainly, both data buffers and file sizes are very small in UnixBench. The file copy 
test results are only shown to prove that the patch has no negative effect on the 
IO performance or some other codes path. In the real world, different performance
 problems have different bottlenecks and need different corresponding solutions. 
Here, this patch can only reduce false sharing in high-contention situations where 
different threads and processes handle the same file. Therefore, there is no obvious 
performance improvement if the bottleneck is something else. However, a reasonable 
layout of file struct is still necessary.

As it is known well, '____cacheline_aligned_in_smp' can be used to separate data into 
different cache lines. However, the size of the corresponding struct also becomes larger 
which could cause more memory costs and performance regression in some other cases. 
Therefore, in this patch, the file struct is only re-layouted carefully. 
To deep dive into the performance, perf c2c is used instead of perf stat. Because the count 
of HITM (Hit Modification) can help to identify the problem of false sharing. The perf c2c 
results are attached as plain text.

With the analysis of the perf c2c result. 
(1) In all of the four cases, more than 99% HITM is caused by the same address,  which is the 
file operator actually. 
(2) Compare the results between running on single NUMA node and running on multi NUMA 
nodes. Both load remote HITM and load remote DRAM increase much more, which caused 
performance regression. In UnixBench, the syscall test run less iterations in specific
 duration (10s) when it is running across numa nodes and the HITM is also less even if the
 same kernel is used. 
(3) Compare the results between kernel with patch and kernel without patch.  The number 
of HITM is decreased, no matter the test is running on single NUMA node or multiple NUMA 
nodes.

Above all, this patch can reduce false sharing caused by file struct in high-contention situation. 
However, if the false sharing is not the bottleneck (i.e. when running on multiple NUMA nodes, 
remote DRAM access may be the new bottleneck), this patch cannot bring obvious 
performance improvement.

If any else performance data should be shown, please feel free to let me know. Cheers.

(a.1) Single Numa Node
=================================================
            Trace Event Information
=================================================
  Total records                     :   12591573
  Locked Load/Store Operations      :     371850
  Load Operations                   :     594693
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          0
  Loads - no mapping                :          0
  Load Fill Buffer Hit              :     136631
  Load L1D hit                      :     346615
  Load L2D hit                      :        113
  Load LLC hit                      :     111273
  Load Local HITM                   :      86212
  Load Remote HITM                  :         16
  Load Remote HIT                   :          0
  Load Local DRAM                   :         16
  Load Remote DRAM                  :         45
  Load MESI State Exclusive         :         45
  Load MESI State Shared            :         16
  Load LLC Misses                   :         77
  Load access blocked by data       :        320
  Load access blocked by address    :     359024
=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ---------------      Tot  ------- Load Hitm -------    Total    
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  
#
      0  0xff2dfd8dfd4cd200     1   86964   99.80%    86058    86058        0   

(a.2) Multi Numa Node
=================================================
            Trace Event Information
=================================================
  Total records                     :   12168050
  Locked Load/Store Operations      :     293778
  Load Operations                   :     422976
  Loads - uncacheable               :          2
  Loads - IO                        :          0
  Loads - Miss                      :          0
  Loads - no mapping                :          0
  Load Fill Buffer Hit              :      99324
  Load L1D hit                      :     248307
  Load L2D hit                      :        130
  Load LLC hit                      :      58779
  Load Local HITM                   :      48665
  Load Remote HITM                  :      13471
  Load Remote HIT                   :          0
  Load Local DRAM                   :       2658
  Load Remote DRAM                  :      13776
  Load MESI State Exclusive         :      13776
  Load MESI State Shared            :       2658
  Load LLC Misses                   :      29905
  Load access blocked by data       :        373
  Load access blocked by address    :     258037
=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ---------------      Tot  ------- Load Hitm -------    
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  
#
      0  0xff2dfd8dfd4cd200     1   59737   99.43%    61780    48409    13371   

(b) Without Patch
(b.1) Single Numa Node
=================================================
            Trace Event Information
=================================================
  Total records                     :   12919299
  Locked Load/Store Operations      :     373015
  Load Operations                   :     764465
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          0
  Loads - no mapping                :          0
  Load Fill Buffer Hit              :     269056
  Load L1D hit                      :     395309
  Load L2D hit                      :         75
  Load LLC hit                      :     100013
  Load Local HITM                   :      95001
  Load Remote HITM                  :          3
  Load Remote HIT                   :          0
  Load Local DRAM                   :          7
  Load Remote DRAM                  :          5
  Load MESI State Exclusive         :          5
  Load MESI State Shared            :          7
  Load LLC Misses                   :         15
  Load access blocked by data       :        335
  Load access blocked by address    :     545957
=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ---------------      Tot  ------- Load Hitm -------    
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  
#
      0  0xff33448e58b68400     0  220190   99.79%    94801    94801        0   

(b.2) Multi Numa Node
=================================================
            Trace Event Information
=================================================
  Total records                     :   12409259
  Locked Load/Store Operations      :     325557
  Load Operations                   :     625900
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :         18
  Loads - no mapping                :          0
  Load Fill Buffer Hit              :     219323
  Load L1D hit                      :     324152
  Load L2D hit                      :         87
  Load LLC hit                      :      64345
  Load Local HITM                   :      61334
  Load Remote HITM                  :      16979
  Load Remote HIT                   :          0
  Load Local DRAM                   :        756
  Load Remote DRAM                  :      17219
  Load MESI State Exclusive         :      17219
  Load MESI State Shared            :        756
  Load LLC Misses                   :      34954
  Load access blocked by data       :        387
  Load access blocked by address    :     443196
=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ---------------      Tot  ------- Load Hitm -------    
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  
#
      0  0xff33448e58b68400     0  190537   99.71%    78088    61176    16912   

Best Regards
Zhiyin