Re: performance regression between 6.1.x and 5.15.x

Wang Yugui <wangyugui@xxxxxxxxxxxx> · Tue, 09 May 2023 20:37:52 +0800

Hi,

> On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > > Hi,
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > > with the compare to xfs 5.15.110.
> > > > > 
> > > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > > 
> > > > > any guide to troubleshoot it?
> > > > > 
> > > > > test case:
> > > > >   disk: NVMe PCIe3 SSD *4 
> > > > >   LVM: raid0 default strip size 64K.
> > > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > > >    -directory=/mnt/test
> > > > > 
> > > > > 
> > > > > 6.1.27/6.1.23
> > > > > fio bw=2623MiB/s (2750MB/s)
> > > > > perf report:
> > > > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > > > Overhead  Command  Shared Object        Symbol
> > > > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > > > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > > > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > > > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > > > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > > > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > > > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > > > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > > > 
> > > > > 5.15.110
> > > > > fio bw=6796MiB/s (7126MB/s)
> > > > > perf report:
> > > > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > > > Overhead  Command  Shared Object       Symbol
> > > > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > > > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > > > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > > > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > > > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > > > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > > > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > > > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> > > 
> > > Because you are testing buffered IO, you need to run perf across all
> > > CPUs and tasks, not just the fio process so that it captures the
> > > profile of memory reclaim and writeback that is being performed by
> > > the kernel.
> > 
> > 'perf report' of all CPU.
> > Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> > Overhead  Command          Shared Object            Symbol
> >   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
> >    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
> >    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
> >    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
> >    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
> >    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
> >    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
> >    1.35%  fio              [kernel.kallsyms]        [k] xas_load
> >    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
> >    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
> >    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
> >    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
> >    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
> >    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> > 
> > and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> 
> Can you provide an expanded callgraph profile for both the good and
> bad kernels showing the CPU used in the fio write() path and the
> kworker-based writeback path?

I'm sorry that some detail guide for info gather of this test please.

the test machine here is already reserved.

> [ The test machine I have that I could reproduce this sort of
> performance anomoly went bad a month ago, so I have no hardware
> available to me right now to reproduce this behaviour locally.
> Hence I'll need you to do the profiling I need to understand the
> regression for me. ]
> 
> > > I suspect that the likely culprit is mm-level changes - the
> > > page reclaim algorithm was completely replaced in 6.1 with a
> > > multi-generation LRU that will have different cache footprint
> > > behaviour in exactly this sort of "repeatedly over-write same files
> > > in a set that are significantly larger than memory" micro-benchmark.
> > > 
> > > i.e. these commits:
> > > 
> > > 07017acb0601 mm: multi-gen LRU: admin guide
> > > d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> > > 1332a809d95a mm: multi-gen LRU: thrashing prevention
> > > 354ed5974429 mm: multi-gen LRU: kill switch
> > > f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> > > bd74fdaea146 mm: multi-gen LRU: support page table walks
> > > 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> > > ac35a4902374 mm: multi-gen LRU: minimal implementation
> > > ec1c86b25f4b mm: multi-gen LRU: groundwork
> > > 
> > > If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> > > same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> > > as you've reported.
> > I tested 6.4.0-rc1. the performance become a little worse.
> 
> Thanks, that's as I expected.
> 
> WHich means that the interesting kernel versions to check now are a
> 6.0.x kernel, and then if it has the same perf as 5.15.x, then the
> commit before the multi-gen LRU was introduced vs the commit after
> the multi-gen LRU was introduced to see if that is the functionality
> that introduced the regression....

more performance test result:

linux 6.0.18
	fio WRITE: bw=2565MiB/s (2689MB/s)
linux 5.17.0
	fio WRITE: bw=2602MiB/s (2729MB/s) 
linux 5.16.20
	fio WRITE: bw=7666MiB/s (8039MB/s),

so it is a problem between 5.16.20 and 5.17.0?

Best Regards
Wang Yugui (wangyugui@xxxxxxxxxxxx)
2023/05/09