Re: performance regression between 6.1.x and 5.15.x

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 11 May 2023 11:34:10 +1000

On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote:
> Hi,
> 
> 
> > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > > Ok, that is further back in time than I expected. In terms of XFS,
> > > > there are only two commits between 5.16..5.17 that might impact
> > > > performance:
> > > > 
> > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > > 
> > > > and
> > > > 
> > > > 6795801366da ("xfs: Support large folios")
> > > > 
> > > > To test whether ebb7fb1557b1 is the cause, go to
> > > > fs/iomap/buffered-io.c and change:
> > > > 
> > > > -#define IOEND_BATCH_SIZE        4096
> > > > +#define IOEND_BATCH_SIZE        1048576
> > > > This will increase the IO submission chain lengths to at least 4GB
> > > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > > 
> > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > > and comment out both calls to mapping_set_large_folios(). This will
> > > > ensure the page cache only instantiates single page folios the same
> > > > as 5.16 would have.
> > > 
> > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > > 
> > > still  performance regression when compare to linux 5.16.20
> > > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > > 
> > > but the performance regression is not too big, then difficult to bisect.
> > > We noticed samle level  performance regression  on btrfs too.
> > > so maby some problem of some code that is  used by both btrfs and xfs
> > > such as iomap and mm/folio.
> > 
> > Yup, that's quite possibly something like the multi-gen LRU changes,
> > but that's not the regression we need to find. :/
> > 
> > > 6.1.x  with 'mapping_set_large_folios remove' only'
> > > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > > 
> > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > > 
> > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > > individual ioend chain lengths in writeback")'.
> > 
> > OK, can you re-run the two 6.1.x kernels above (the slow and the
> > fast) and record the output of `iostat -dxm 1` whilst the
> > fio test is running? I want to see what the overall differences in
> > the IO load on the devices are between the two runs. This will tell
> > us how the IO sizes and queue depths change between the two kernels,
> > etc.
> 
> `iostat -dxm 1` result saved in attachment file.
> good.txt	good performance
> bad.txt		bad performance

Thanks!

What I see here is that neither the good or the bad config are able
to drive the hardware to 100% utilisation, but the way the IO stack
is behaving is identical. The only difference is that
the good config is driving much more IO to the devices, such that
the top level RAID0 stripe reports ~90% utilisation vs 50%
utilisation.

What this says to me is that the limitation in throughput is the
single threaded background IO submission (the bdi-flush thread) is
CPU bound in both cases, and that the difference is in how much CPU
each IO submission is consuming.

>From some tests here at lower bandwidth (1-2GB/s) with a batch size
of 4096, I'm seeing the vast majority of submission CPU time being
spent in folio_start_writeback(), and the vast majority of CPU time
in IO completion being spent in folio_end_writeback. There's an
order of magnitude more CPU time in these functions than in any of
the XFS or iomap writeback functions.

A typical 5 second expanded snapshot profile (from `perf top -g -U`)
of the bdi-flusher thread looks like this:

   99.22%     3.68%  [kernel]  [k] write_cache_pages
   - 65.13% write_cache_pages
      - 46.84% iomap_do_writepage
         - 35.50% __folio_start_writeback
            - 7.94% _raw_spin_lock_irqsave
               - 11.35% do_raw_spin_lock
                    __pv_queued_spin_lock_slowpath
            - 5.37% _raw_spin_unlock_irqrestore
               - 5.32% do_raw_spin_unlock
                    __raw_callee_save___pv_queued_spin_unlock
               - 0.92% asm_common_interrupt
                    common_interrupt
                    __common_interrupt
                    handle_edge_irq
                    handle_irq_event
                    __handle_irq_event_percpu
                    vring_interrupt
                    virtblk_done
            - 4.18% __mod_lruvec_page_state
               - 2.18% __mod_lruvec_state
                    1.16% __mod_node_page_state
                    0.68% __mod_memcg_lruvec_state
                 0.90% __mod_memcg_lruvec_state
              2.88% xas_descend
              1.63% percpu_counter_add_batch
              1.63% mod_zone_page_state
              1.15% xas_load
              1.11% xas_start
              0.93% __rcu_read_unlock
            - 0.89% folio_memcg_lock
              0.63% asm_common_interrupt
                 common_interrupt
                 __common_interrupt
                 handle_edge_irq
                 handle_irq_event
                 __handle_irq_event_percpu
                 vring_interrupt
                 virtblk_done
                 virtblk_complete_batch
                 blk_mq_end_request_batch
                 bio_endio
                 iomap_writepage_end_bio
                 iomap_finish_ioend
         - 2.75% xfs_map_blocks
            - 1.55% __might_sleep
                 1.26% __might_resched
         - 1.90% bio_add_folio
              1.13% __bio_try_merge_page
         - 1.82% submit_bio
            - submit_bio_noacct
               - 1.82% submit_bio_noacct_nocheck
                  - __submit_bio
                       1.77% blk_mq_submit_bio
           1.27% inode_to_bdi
           1.19% xas_clear_mark
           0.65% xas_set_mark
           0.57% iomap_page_create.isra.0
      - 12.91% folio_clear_dirty_for_io
         - 2.72% __mod_lruvec_page_state
            - 1.84% __mod_lruvec_state
                 0.98% __mod_node_page_state
                 0.58% __mod_memcg_lruvec_state
           1.55% mod_zone_page_state
           1.49% percpu_counter_add_batch
         - 0.72% asm_common_interrupt
              common_interrupt
              __common_interrupt
              handle_edge_irq
              handle_irq_event
              __handle_irq_event_percpu
              vring_interrupt
              virtblk_done
              virtblk_complete_batch
              blk_mq_end_request_batch
              bio_endio
              iomap_writepage_end_bio
              iomap_finish_ioend
           0.55% folio_mkclean
      - 8.08% filemap_get_folios_tag
           1.84% xas_find_marked
      - 1.89% __pagevec_release
           1.87% release_pages
      - 1.65% __might_sleep
           1.33% __might_resched
        1.22% folio_unlock
   - 3.68% ret_from_fork
        kthread
        worker_thread
        process_one_work
        wb_workfn
        wb_writeback
        __writeback_inodes_wb
        writeback_sb_inodes
        __writeback_single_inode
        do_writepages
        xfs_vm_writepages
        iomap_writepages
        write_cache_pages

This indicates that 35% of writeback submission CPU is in
__folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
is in filemap_get_folios_tag() and only ~8% of CPU time is in the
rest of the iomap/XFS code building and submitting bios from the
folios passed to it.  i.e.  it looks a lot like writeback is is
contending with the incoming write(), IO completion and memory
reclaim contexts for access to the page cache mapping and mm
accounting structures.

Unfortunately, I don't have access to hardware that I can use to
confirm this is the cause, but it doesn't look like it's directly an
XFS/iomap issue at this point. The larger batch sizes reduce both
memory reclaim and IO completion competition with submission, so it
kinda points in this direction.

I suspect we need to start using high order folios in the write path
where we have large user IOs for streaming writes, but I also wonder
if there isn't some sort of batched accounting/mapping tree updates
we could do for all the adjacent folios in a single bio....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx