Re: performance regression between 6.1.x and 5.15.x

Wang Yugui <wangyugui@xxxxxxxxxxxx> · Tue, 09 May 2023 07:25:53 +0800

Hi,

> On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > Hi,
> > > 
> > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > with the compare to xfs 5.15.110.
> > > 
> > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > 
> > > any guide to troubleshoot it?
> > > 
> > > test case:
> > >   disk: NVMe PCIe3 SSD *4 
> > >   LVM: raid0 default strip size 64K.
> > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > >    -directory=/mnt/test
> > > 
> > > 
> > > 6.1.27/6.1.23
> > > fio bw=2623MiB/s (2750MB/s)
> > > perf report:
> > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > Overhead  Command  Shared Object        Symbol
> > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > 
> > > 5.15.110
> > > fio bw=6796MiB/s (7126MB/s)
> > > perf report:
> > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > Overhead  Command  Shared Object       Symbol
> > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> 
> Because you are testing buffered IO, you need to run perf across all
> CPUs and tasks, not just the fio process so that it captures the
> profile of memory reclaim and writeback that is being performed by
> the kernel.

'perf report' of all CPU.
Samples: 211K of event 'cycles', Event count (approx.): 56590727219
Overhead  Command          Shared Object            Symbol
  16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
   3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
   3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
   3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
   2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
   2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
   2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
   1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
   1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
   1.35%  fio              [kernel.kallsyms]        [k] xas_load
   1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
   1.31%  fio              [kernel.kallsyms]        [k] xas_descend
   1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
   1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
   1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
   1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave

and 'top' show that 'kworker/u98:1' have over 80% CPU usage.

> > more info:
> > 
> > 1, 6.2.14 have same performance as 6.1.x
> > 
> > 2,  6.1.x fio performance detail:
> > Jobs: 4 (f=4): [W(4)][16.7%][w=10.2GiB/s][w=10.4k IOPS][eta 00m:15s]
> > Jobs: 4 (f=4): [W(4)][25.0%][w=9949MiB/s][w=9949 IOPS][eta 00m:12s] 
> > Jobs: 4 (f=4): [W(4)][31.2%][w=9618MiB/s][w=9618 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][37.5%][w=7970MiB/s][w=7970 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][41.2%][w=5048MiB/s][w=5047 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][42.1%][w=2489MiB/s][w=2488 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][42.9%][w=3227MiB/s][w=3226 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][45.5%][w=3622MiB/s][w=3622 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][47.8%][w=3651MiB/s][w=3650 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][52.2%][w=3435MiB/s][w=3435 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=2464MiB/s][w=2463 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][53.8%][w=2438MiB/s][w=2438 IOPS][eta 00m:12s]
> 
> Looks like it's throttled on dirty pages at this point.
> 
> How much memory does your test system have, and what does changing
> the writeback throttling thresholds do? What's the numa layout?

This is a DELL server with two E5-2680 v2 configured as NUMA.

There is 192G memory in this server,
128G memory is installed in CPU1,
 64G memory is installed in CPU2.

and sysctl config:
vm.dirty_background_bytes = 1073741824  # 1G
vm.dirty_background_ratio = 0
vm.dirty_bytes = 8589934592 # 8G
vm.dirty_ratio = 0

and NVMe SSD *4 are connected to  one NVMe hba / CPU1.

> Watching the stats in /proc/meminfo would be useful. I tend to use
> Performance Co-Pilot (PCP) to collect these sorts of stats and plot
> them in real time so I can see how the state of the machine is
> changing as the test is running....
> 
> 
> > Jobs: 4 (f=4): [W(4)][55.6%][w=2435MiB/s][w=2434 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][57.1%][w=2449MiB/s][w=2448 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][60.7%][w=2422MiB/s][w=2421 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.1%][w=2457MiB/s][w=2457 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][63.3%][w=2436MiB/s][w=2436 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][64.5%][w=2432MiB/s][w=2431 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][67.7%][w=2440MiB/s][w=2440 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][71.0%][w=2437MiB/s][w=2437 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][74.2%][w=2442MiB/s][w=2442 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][77.4%][w=2425MiB/s][w=2424 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][80.6%][w=2459MiB/s][w=2459 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(4)][86.7%][w=2428MiB/s][w=2427 IOPS][eta 00m:04s]
> > Jobs: 4 (f=4): [W(4)][90.0%][w=2441MiB/s][w=2440 IOPS][eta 00m:03s]
> > Jobs: 4 (f=4): [W(4)][93.3%][w=2438MiB/s][w=2437 IOPS][eta 00m:02s]
> > Jobs: 4 (f=4): [W(4)][96.7%][w=2450MiB/s][w=2449 IOPS][eta 00m:01s]
> > Jobs: 4 (f=4): [W(4)][100.0%][w=2430MiB/s][w=2430 IOPS][eta 00m:00s]
> > Jobs: 4 (f=4): [F(4)][100.0%][w=2372MiB/s][w=2372 IOPS][eta 00m:00s]
> 
> fsync at the end is instant, which indicates that writing into the
> kernel is almost certainly being throttled.
> 
> > 5.15 fio performance detail:
> > Jobs: 4 (f=4): [W(4)][14.3%][w=8563MiB/s][w=8563 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][18.2%][w=6376MiB/s][w=6375 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][20.8%][w=4566MiB/s][w=4565 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][23.1%][w=3947MiB/s][w=3947 IOPS][eta 00m:20s]
> 
> So throttling starts and perf drops...
> 
> > Jobs: 4 (f=4): [W(4)][25.9%][w=4601MiB/s][w=4601 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][28.6%][w=5797MiB/s][w=5796 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][32.1%][w=6802MiB/s][w=6801 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][35.7%][w=7411MiB/s][w=7411 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][40.7%][w=8445MiB/s][w=8444 IOPS][eta 00m:16s]
> 
> .... and then it picks back up....
> 
> > Jobs: 4 (f=4): [W(4)][46.2%][w=7992MiB/s][w=7992 IOPS][eta 00m:14s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=8118MiB/s][w=8117 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][56.0%][w=7742MiB/s][w=7741 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.5%][w=7497MiB/s][w=7496 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][66.7%][w=7248MiB/s][w=7248 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][70.8%][w=7461MiB/s][w=7460 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][75.0%][w=7959MiB/s][w=7959 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(3),F(1)][79.2%][w=6982MiB/s][w=6982 IOPS][eta 00m:05s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][87.0%][w=2809MiB/s][w=2808 IOPS][eta 00m:03s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][95.5%][w=2669MiB/s][w=2668 IOPS][eta 00m:01s]
> > Jobs: 1 (f=1): [_(2),F(1),_(1)][100.0%][w=2552MiB/s][w=2552 IOPS][eta 00m:00s]
> 
> I suspect that the likely culprit is mm-level changes - the
> page reclaim algorithm was completely replaced in 6.1 with a
> multi-generation LRU that will have different cache footprint
> behaviour in exactly this sort of "repeatedly over-write same files
> in a set that are significantly larger than memory" micro-benchmark.
> 
> i.e. these commits:
> 
> 07017acb0601 mm: multi-gen LRU: admin guide
> d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> 1332a809d95a mm: multi-gen LRU: thrashing prevention
> 354ed5974429 mm: multi-gen LRU: kill switch
> f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> bd74fdaea146 mm: multi-gen LRU: support page table walks
> 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> ac35a4902374 mm: multi-gen LRU: minimal implementation
> ec1c86b25f4b mm: multi-gen LRU: groundwork
> 
> If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> as you've reported.

I tested 6.4.0-rc1. the performance become a little worse.

Best Regards
Wang Yugui (wangyugui@xxxxxxxxxxxx)
2023/05/09

> So I'm thinking that the best thing to do is confirm that the change
> of behaviour is a result of the multi-gen LRU changes. If it is,
> then it's up to the multi-gen LRU developers to determine how to fix
> it.  If it's not the multi-gen LRU, we'll have to kep digging.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx