Re: performance regression between 6.1.x and 5.15.x

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 May 2023 17:27:06 +1000

On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > Ok, that is further back in time than I expected. In terms of XFS,
> > there are only two commits between 5.16..5.17 that might impact
> > performance:
> > 
> > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > 
> > and
> > 
> > 6795801366da ("xfs: Support large folios")
> > 
> > To test whether ebb7fb1557b1 is the cause, go to
> > fs/iomap/buffered-io.c and change:
> > 
> > -#define IOEND_BATCH_SIZE        4096
> > +#define IOEND_BATCH_SIZE        1048576
> > This will increase the IO submission chain lengths to at least 4GB
> > from the 16MB bound that was placed on 5.17 and newer kernels.
> > 
> > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > and comment out both calls to mapping_set_large_folios(). This will
> > ensure the page cache only instantiates single page folios the same
> > as 5.16 would have.
> 
> 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> 	fio WRITE: bw=6451MiB/s (6764MB/s)
> 
> still  performance regression when compare to linux 5.16.20
> 	fio WRITE: bw=7666MiB/s (8039MB/s),
> 
> but the performance regression is not too big, then difficult to bisect.
> We noticed samle level  performance regression  on btrfs too.
> so maby some problem of some code that is  used by both btrfs and xfs
> such as iomap and mm/folio.

Yup, that's quite possibly something like the multi-gen LRU changes,
but that's not the regression we need to find. :/

> 6.1.x  with 'mapping_set_large_folios remove' only'
> 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> 
> 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> 	fio WRITE: bw=5092MiB/s (5339MB/s),
> 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> 
> maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> individual ioend chain lengths in writeback")'.

OK, can you re-run the two 6.1.x kernels above (the slow and the
fast) and record the output of `iostat -dxm 1` whilst the
fio test is running? I want to see what the overall differences in
the IO load on the devices are between the two runs. This will tell
us how the IO sizes and queue depths change between the two kernels,
etc.

Right now I'm suspecting a contention interaction between write(),
do_writepages() and folio_end_writeback()...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx