Re: XFS writing issue

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 12 Jul 2023 09:11:24 +1000

On Tue, Jul 11, 2023 at 05:31:13PM +0200, Eugene K. wrote:
> Hello.
> 
> During investigation of flapping performance problem, it was detected that
> once a process writes big amount of data in a row, the filesystem focus on
> this writing and no other process can perform any IO on this filesystem.

What hardware are you testing on? ram, cpus, SSD models, etc.

Also, xfs_info for the filesystem you are testing, output of 'grep .
/proc/sys/vm/*' as well as dumps of 'iostat -dxm 1' and 'vmstat 1'
while you are running the test. Also capture the dmesg output of
'echo w > /proc/sysrq-trigger' and 'cat /proc/meminfo' multiple
times while the test is running.

> We have noticed huge %iowait on software raid1 (mdraid) that runs on 2 SSD
> drives - on every attempt to write more than 1GB.

I would expect "huge" iowait for this workload because the
bandwidth of the pipe is much greater than the bandwidth of your MD
device and so the writes to the fs get throttled in
balance_dirty_pages_ratelimited() once a certain percentage of RAM
is dirtied.

> The issue happens on any server running 6.4.2, 6.4.0, 6.3.3, 6.2.12 kernel.
> Upon investigating and testing it appeared that server IO performance can be
> completely killed with a single command:
> 
> #cat /dev/zero > ./removeme

flat profile from 'perf top -U':

  35.85%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   6.86%  [kernel]  [k] rep_movs_alternative
   5.92%  [kernel]  [k] do_raw_spin_lock
   5.61%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.62%  [kernel]  [k] rep_stos_alternative
   2.25%  [kernel]  [k] do_raw_spin_unlock
   1.77%  [kernel]  [k] __folio_end_writeback
   1.68%  [kernel]  [k] xas_start
   1.60%  [kernel]  [k] xas_descend
   1.46%  [kernel]  [k] __remove_mapping
   1.36%  [kernel]  [k] __folio_start_writeback
   1.05%  [kernel]  [k] __filemap_add_folio
   0.98%  [kernel]  [k] iomap_write_begin
   0.87%  [kernel]  [k] percpu_counter_add_batch
   0.83%  [kernel]  [k] folio_clear_dirty_for_io
   0.82%  [kernel]  [k] get_page_from_freelist  
   0.79%  [kernel]  [k] iomap_write_end         
   0.78%  [kernel]  [k] inode_to_bdi
   0.72%  [kernel]  [k] folio_unlock 
   0.71%  [kernel]  [k] node_dirty_ok
   0.71%  [kernel]  [k] __mod_node_page_state
   0.65%  [kernel]  [k] write_cache_pages
   0.65%  [kernel]  [k] __might_resched        
   0.65%  [kernel]  [k] __mod_lruvec_page_state
   0.64%  [kernel]  [k] iomap_do_writepage
   0.57%  [kernel]  [k] xas_store         
   0.53%  [kernel]  [k] shrink_folio_list
   0.50%  [kernel]  [k] balance_dirty_pages_ratelimited_flags
   0.49%  [kernel]  [k] __mod_memcg_lruvec_state
   0.49%  [kernel]  [k] filemap_dirty_folio
   0.48%  [kernel]  [k] __folio_mark_dirty
   0.48%  [kernel]  [k] __rmqueue_pcplist
   0.48%  [kernel]  [k] __rcu_read_lock         
   0.45%  [kernel]  [k] xas_load             
   0.43%  [kernel]  [k] __mod_zone_page_state
   0.40%  [kernel]  [k] lru_add_fn
   0.39%  [kernel]  [k] __list_del_entry_valid
   0.38%  [kernel]  [k] mod_zone_page_state   
   0.37%  [kernel]  [k] __filemap_remove_folio
   0.36%  [kernel]  [k] node_page_state    
   0.34%  [kernel]  [k] __filemap_get_folio
   0.33%  [kernel]  [k] filemap_get_folios_tag
   0.31%  [kernel]  [k] isolate_lru_folios
   0.30%  [kernel]  [k] folio_end_writeback

Almost nothing XFS there - it's all lock contention in the page
cache.

This smells of mapping tree lock contention. Yup, the callgraph
profile indicates all that lock contention is on the mapping
tree between kswapd (multiple processes) the write process, the
writeback worker and the XFS IO completion worker.

Hmmm - system is definitely slow. Ah - the write to the file fills
all of free memory with page cache pages on the same mapping, then
every memory allocation requires reclaiming memory, and so they go
into direct reclaim and that adds even more lock contention to the
mapping tree lock....

IOWs, this looks like an mapping tree lock contention problem at it's
core. The mapping tree is exposed to unbound concurrency in these
sorts of situations

> assuming the ~/removeme file resides on rootfs and rootfs is XFS.

Doesn't need to be the root fs - I just did it on an XFS filesystem
mounted on /mnt/scratch with a ext3 rootfs

> While running this, the server becomes so unresponsive that after ~15
> seconds it's not even possible to login via ssh!

Direct memory reclaim getting stuck on the mapping lock because it
adds to the contention problem?

> We did reproduce this on every machine with XFS as rootfs running mentioned
> kernels. However, when we converted rootfs from XFS to EXT4(and btrfs), the
> problem disappeared - with the same OS, same kernel binary, same hardware,
> just using ext4 or btrfs instead of xfs.

Experience has taught me that XFS tends to trigger lock contention
problems in generic code sooner than other filesystems. So this
wouldn't be unexpected, but if the cause is really mapping tree lock
contention then XFS is just the canary....

> Note. During the hang and being unresponsive, SSD drives are
> writing data at expected performance. Just all the processes
> except the writing one hang.

Yup, that's definitely expected - everything on the write() and
writeback side is running at full IO speed, it's just that
everything else is thrashing on the mapping tree waiting for IO to
clean pages....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx