On Tue, Jul 11, 2023 at 05:31:13PM +0200, Eugene K. wrote: > Hello. > > During investigation of flapping performance problem, it was detected that > once a process writes big amount of data in a row, the filesystem focus on > this writing and no other process can perform any IO on this filesystem. What hardware are you testing on? ram, cpus, SSD models, etc. Also, xfs_info for the filesystem you are testing, output of 'grep . /proc/sys/vm/*' as well as dumps of 'iostat -dxm 1' and 'vmstat 1' while you are running the test. Also capture the dmesg output of 'echo w > /proc/sysrq-trigger' and 'cat /proc/meminfo' multiple times while the test is running. > We have noticed huge %iowait on software raid1 (mdraid) that runs on 2 SSD > drives - on every attempt to write more than 1GB. I would expect "huge" iowait for this workload because the bandwidth of the pipe is much greater than the bandwidth of your MD device and so the writes to the fs get throttled in balance_dirty_pages_ratelimited() once a certain percentage of RAM is dirtied. > The issue happens on any server running 6.4.2, 6.4.0, 6.3.3, 6.2.12 kernel. > Upon investigating and testing it appeared that server IO performance can be > completely killed with a single command: > > #cat /dev/zero > ./removeme flat profile from 'perf top -U': 35.85% [kernel] [k] __pv_queued_spin_lock_slowpath 6.86% [kernel] [k] rep_movs_alternative 5.92% [kernel] [k] do_raw_spin_lock 5.61% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.62% [kernel] [k] rep_stos_alternative 2.25% [kernel] [k] do_raw_spin_unlock 1.77% [kernel] [k] __folio_end_writeback 1.68% [kernel] [k] xas_start 1.60% [kernel] [k] xas_descend 1.46% [kernel] [k] __remove_mapping 1.36% [kernel] [k] __folio_start_writeback 1.05% [kernel] [k] __filemap_add_folio 0.98% [kernel] [k] iomap_write_begin 0.87% [kernel] [k] percpu_counter_add_batch 0.83% [kernel] [k] folio_clear_dirty_for_io 0.82% [kernel] [k] get_page_from_freelist 0.79% [kernel] [k] iomap_write_end 0.78% [kernel] [k] inode_to_bdi 0.72% [kernel] [k] folio_unlock 0.71% [kernel] [k] node_dirty_ok 0.71% [kernel] [k] __mod_node_page_state 0.65% [kernel] [k] write_cache_pages 0.65% [kernel] [k] __might_resched 0.65% [kernel] [k] __mod_lruvec_page_state 0.64% [kernel] [k] iomap_do_writepage 0.57% [kernel] [k] xas_store 0.53% [kernel] [k] shrink_folio_list 0.50% [kernel] [k] balance_dirty_pages_ratelimited_flags 0.49% [kernel] [k] __mod_memcg_lruvec_state 0.49% [kernel] [k] filemap_dirty_folio 0.48% [kernel] [k] __folio_mark_dirty 0.48% [kernel] [k] __rmqueue_pcplist 0.48% [kernel] [k] __rcu_read_lock 0.45% [kernel] [k] xas_load 0.43% [kernel] [k] __mod_zone_page_state 0.40% [kernel] [k] lru_add_fn 0.39% [kernel] [k] __list_del_entry_valid 0.38% [kernel] [k] mod_zone_page_state 0.37% [kernel] [k] __filemap_remove_folio 0.36% [kernel] [k] node_page_state 0.34% [kernel] [k] __filemap_get_folio 0.33% [kernel] [k] filemap_get_folios_tag 0.31% [kernel] [k] isolate_lru_folios 0.30% [kernel] [k] folio_end_writeback Almost nothing XFS there - it's all lock contention in the page cache. This smells of mapping tree lock contention. Yup, the callgraph profile indicates all that lock contention is on the mapping tree between kswapd (multiple processes) the write process, the writeback worker and the XFS IO completion worker. Hmmm - system is definitely slow. Ah - the write to the file fills all of free memory with page cache pages on the same mapping, then every memory allocation requires reclaiming memory, and so they go into direct reclaim and that adds even more lock contention to the mapping tree lock.... IOWs, this looks like an mapping tree lock contention problem at it's core. The mapping tree is exposed to unbound concurrency in these sorts of situations > assuming the ~/removeme file resides on rootfs and rootfs is XFS. Doesn't need to be the root fs - I just did it on an XFS filesystem mounted on /mnt/scratch with a ext3 rootfs > While running this, the server becomes so unresponsive that after ~15 > seconds it's not even possible to login via ssh! Direct memory reclaim getting stuck on the mapping lock because it adds to the contention problem? > We did reproduce this on every machine with XFS as rootfs running mentioned > kernels. However, when we converted rootfs from XFS to EXT4(and btrfs), the > problem disappeared - with the same OS, same kernel binary, same hardware, > just using ext4 or btrfs instead of xfs. Experience has taught me that XFS tends to trigger lock contention problems in generic code sooner than other filesystems. So this wouldn't be unexpected, but if the cause is really mapping tree lock contention then XFS is just the canary.... > Note. During the hang and being unresponsive, SSD drives are > writing data at expected performance. Just all the processes > except the writing one hang. Yup, that's definitely expected - everything on the write() and writeback side is running at full IO speed, it's just that everything else is thrashing on the mapping tree waiting for IO to clean pages.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx