Re: performance regression between 6.1.x and 5.15.x

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 May 2023 08:14:11 +1000

On Tue, May 09, 2023 at 08:37:52PM +0800, Wang Yugui wrote:
> > On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> > > > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > > > with the compare to xfs 5.15.110.
> > > > > > 
> > > > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > > > 
> > > > > > any guide to troubleshoot it?
> > > > > > 
> > > > > > test case:
> > > > > >   disk: NVMe PCIe3 SSD *4 
> > > > > >   LVM: raid0 default strip size 64K.
> > > > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > > > >    -directory=/mnt/test
.....
> > > > Because you are testing buffered IO, you need to run perf across all
> > > > CPUs and tasks, not just the fio process so that it captures the
> > > > profile of memory reclaim and writeback that is being performed by
> > > > the kernel.
> > > 
> > > 'perf report' of all CPU.
> > > Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> > > Overhead  Command          Shared Object            Symbol
> > >   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
> > >    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> > >    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> > >    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
> > >    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
> > >    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
> > >    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
> > >    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
> > >    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
> > >    1.35%  fio              [kernel.kallsyms]        [k] xas_load
> > >    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
> > >    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
> > >    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
> > >    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
> > >    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
> > >    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> > > 
> > > and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> > 
> > Can you provide an expanded callgraph profile for both the good and
> > bad kernels showing the CPU used in the fio write() path and the
> > kworker-based writeback path?
> 
> I'm sorry that some detail guide for info gather of this test please.

'perf record -g' and 'perf report -g' should enable callgraph
profiling and reporting. See the perf-record man page for
'--callgraph' to make sure you have the right kernel config for this
to work efficiently.

You can do quick snapshots in time via 'perf top -U -g' and then
after a few seconds type 'E' then immediately type 'P' and the fully
expanded callgraph profile will get written to a perf.hist.N file in
the current working directory...

> > > I tested 6.4.0-rc1. the performance become a little worse.
> > 
> > Thanks, that's as I expected.
> > 
> > WHich means that the interesting kernel versions to check now are a
> > 6.0.x kernel, and then if it has the same perf as 5.15.x, then the
> > commit before the multi-gen LRU was introduced vs the commit after
> > the multi-gen LRU was introduced to see if that is the functionality
> > that introduced the regression....
> 
> more performance test result:
> 
> linux 6.0.18
> 	fio WRITE: bw=2565MiB/s (2689MB/s)
> linux 5.17.0
> 	fio WRITE: bw=2602MiB/s (2729MB/s) 
> linux 5.16.20
> 	fio WRITE: bw=7666MiB/s (8039MB/s),
> 
> so it is a problem between 5.16.20 and 5.17.0?

Ok, that is further back in time than I expected. In terms of XFS,
there are only two commits between 5.16..5.17 that might impact
performance:

ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")

and

6795801366da ("xfs: Support large folios")

To test whether ebb7fb1557b1 is the cause, go to
fs/iomap/buffered-io.c and change:

-#define IOEND_BATCH_SIZE        4096
+#define IOEND_BATCH_SIZE        1048576

This will increase the IO submission chain lengths to at least 4GB
from the 16MB bound that was placed on 5.17 and newer kernels.

To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
and comment out both calls to mapping_set_large_folios(). This will
ensure the page cache only instantiates single page folios the same
as 5.16 would have.

If neither of them change behaviour, then I think you're going to
need to do a bisect between 5.16..5.17 to find the commit that
introduced the regression. I know kernel bisects are slow and
painful, but it's exactly what I'd be doing right now if my
performance test machine wasn't broken....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx