On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote: > Hi, > > > > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote: > > > > Ok, that is further back in time than I expected. In terms of XFS, > > > > there are only two commits between 5.16..5.17 that might impact > > > > performance: > > > > > > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback") > > > > > > > > and > > > > > > > > 6795801366da ("xfs: Support large folios") > > > > > > > > To test whether ebb7fb1557b1 is the cause, go to > > > > fs/iomap/buffered-io.c and change: > > > > > > > > -#define IOEND_BATCH_SIZE 4096 > > > > +#define IOEND_BATCH_SIZE 1048576 > > > > This will increase the IO submission chain lengths to at least 4GB > > > > from the 16MB bound that was placed on 5.17 and newer kernels. > > > > > > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c > > > > and comment out both calls to mapping_set_large_folios(). This will > > > > ensure the page cache only instantiates single page folios the same > > > > as 5.16 would have. > > > > > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576' > > > fio WRITE: bw=6451MiB/s (6764MB/s) > > > > > > still performance regression when compare to linux 5.16.20 > > > fio WRITE: bw=7666MiB/s (8039MB/s), > > > > > > but the performance regression is not too big, then difficult to bisect. > > > We noticed samle level performance regression on btrfs too. > > > so maby some problem of some code that is used by both btrfs and xfs > > > such as iomap and mm/folio. > > > > Yup, that's quite possibly something like the multi-gen LRU changes, > > but that's not the regression we need to find. :/ > > > > > 6.1.x with 'mapping_set_large_folios remove' only' > > > fio WRITE: bw=2676MiB/s (2806MB/s) > > > > > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only' > > > fio WRITE: bw=5092MiB/s (5339MB/s), > > > fio WRITE: bw=6076MiB/s (6371MB/s) > > > > > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit > > > individual ioend chain lengths in writeback")'. > > > > OK, can you re-run the two 6.1.x kernels above (the slow and the > > fast) and record the output of `iostat -dxm 1` whilst the > > fio test is running? I want to see what the overall differences in > > the IO load on the devices are between the two runs. This will tell > > us how the IO sizes and queue depths change between the two kernels, > > etc. > > `iostat -dxm 1` result saved in attachment file. > good.txt good performance > bad.txt bad performance Thanks! What I see here is that neither the good or the bad config are able to drive the hardware to 100% utilisation, but the way the IO stack is behaving is identical. The only difference is that the good config is driving much more IO to the devices, such that the top level RAID0 stripe reports ~90% utilisation vs 50% utilisation. What this says to me is that the limitation in throughput is the single threaded background IO submission (the bdi-flush thread) is CPU bound in both cases, and that the difference is in how much CPU each IO submission is consuming. >From some tests here at lower bandwidth (1-2GB/s) with a batch size of 4096, I'm seeing the vast majority of submission CPU time being spent in folio_start_writeback(), and the vast majority of CPU time in IO completion being spent in folio_end_writeback. There's an order of magnitude more CPU time in these functions than in any of the XFS or iomap writeback functions. A typical 5 second expanded snapshot profile (from `perf top -g -U`) of the bdi-flusher thread looks like this: 99.22% 3.68% [kernel] [k] write_cache_pages - 65.13% write_cache_pages - 46.84% iomap_do_writepage - 35.50% __folio_start_writeback - 7.94% _raw_spin_lock_irqsave - 11.35% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 5.37% _raw_spin_unlock_irqrestore - 5.32% do_raw_spin_unlock __raw_callee_save___pv_queued_spin_unlock - 0.92% asm_common_interrupt common_interrupt __common_interrupt handle_edge_irq handle_irq_event __handle_irq_event_percpu vring_interrupt virtblk_done - 4.18% __mod_lruvec_page_state - 2.18% __mod_lruvec_state 1.16% __mod_node_page_state 0.68% __mod_memcg_lruvec_state 0.90% __mod_memcg_lruvec_state 2.88% xas_descend 1.63% percpu_counter_add_batch 1.63% mod_zone_page_state 1.15% xas_load 1.11% xas_start 0.93% __rcu_read_unlock - 0.89% folio_memcg_lock 0.63% asm_common_interrupt common_interrupt __common_interrupt handle_edge_irq handle_irq_event __handle_irq_event_percpu vring_interrupt virtblk_done virtblk_complete_batch blk_mq_end_request_batch bio_endio iomap_writepage_end_bio iomap_finish_ioend - 2.75% xfs_map_blocks - 1.55% __might_sleep 1.26% __might_resched - 1.90% bio_add_folio 1.13% __bio_try_merge_page - 1.82% submit_bio - submit_bio_noacct - 1.82% submit_bio_noacct_nocheck - __submit_bio 1.77% blk_mq_submit_bio 1.27% inode_to_bdi 1.19% xas_clear_mark 0.65% xas_set_mark 0.57% iomap_page_create.isra.0 - 12.91% folio_clear_dirty_for_io - 2.72% __mod_lruvec_page_state - 1.84% __mod_lruvec_state 0.98% __mod_node_page_state 0.58% __mod_memcg_lruvec_state 1.55% mod_zone_page_state 1.49% percpu_counter_add_batch - 0.72% asm_common_interrupt common_interrupt __common_interrupt handle_edge_irq handle_irq_event __handle_irq_event_percpu vring_interrupt virtblk_done virtblk_complete_batch blk_mq_end_request_batch bio_endio iomap_writepage_end_bio iomap_finish_ioend 0.55% folio_mkclean - 8.08% filemap_get_folios_tag 1.84% xas_find_marked - 1.89% __pagevec_release 1.87% release_pages - 1.65% __might_sleep 1.33% __might_resched 1.22% folio_unlock - 3.68% ret_from_fork kthread worker_thread process_one_work wb_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages xfs_vm_writepages iomap_writepages write_cache_pages This indicates that 35% of writeback submission CPU is in __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8% is in filemap_get_folios_tag() and only ~8% of CPU time is in the rest of the iomap/XFS code building and submitting bios from the folios passed to it. i.e. it looks a lot like writeback is is contending with the incoming write(), IO completion and memory reclaim contexts for access to the page cache mapping and mm accounting structures. Unfortunately, I don't have access to hardware that I can use to confirm this is the cause, but it doesn't look like it's directly an XFS/iomap issue at this point. The larger batch sizes reduce both memory reclaim and IO completion competition with submission, so it kinda points in this direction. I suspect we need to start using high order folios in the write path where we have large user IOs for streaming writes, but I also wonder if there isn't some sort of batched accounting/mapping tree updates we could do for all the adjacent folios in a single bio.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx