On Fri, Feb 25, 2022 at 07:31:17PM +0530, Chandan Babu R wrote: > Hi Dave, > > When the following fio command is executed, > > fio --eta=always --output=runlogs/randwrite4k_64jobs.out -name fio.test > --directory=/data --rw=randwrite --bs=4k --size=4G --ioengine=libaio > --iodepth=16 --direct=1 --time_based=1 --runtime=900 --randrepeat=1 > --gtod_reduce=1 --group_reporting=1 --numjobs=64 So random writes to a dataset of 256GB of files? So how is fio preallocating the run file? By writing zeros or fallocate()? If fallocate(), then unwritten extent conversion is needed after every random write, and that's going to end up being the performance limiting factor. What happens with a workload like this where there are 1024 random writes in flight at once, we end up with 1024 kworkers running DIO IO completion racing to convert unwritten extents. They all take out unwritten extent conversion transaction reservations, and that runs the log out of space. $ pgrep kworker |wc -l 3025 $ When the log runs out of space, we switch to the reservation slowpath where tasks are put to sleep. This which means it goes from lockless to spinlock serialised: - 62.46% 0.00% [kernel] [k] iomap_dio_complete_work - iomap_dio_complete_work - 62.34% iomap_dio_complete - 62.31% xfs_dio_write_end_io - 62.30% xfs_iomap_write_unwritten - 36.25% xfs_trans_alloc_inode - 35.09% xfs_trans_alloc - 35.08% xfs_trans_reserve - 35.05% xfs_log_reserve - 34.99% xlog_grant_head_check - 20.16% _raw_spin_lock - do_raw_spin_lock 20.13% __pv_queued_spin_lock_slowpath - 14.14% xlog_grant_head_wait - 13.90% _raw_spin_lock - do_raw_spin_lock 13.88% __pv_queued_spin_lock_slowpath - 0.66% xlog_grant_head_wake - 0.64% wake_up_process - try_to_wake_up - 0.55% _raw_spin_lock_irqsave do_raw_spin_lock - 1.14% xfs_ilock - down_write - 1.11% rwsem_down_write_slowpath 0.56% rwsem_spin_on_owner - 24.80% xfs_trans_commit - __xfs_trans_commit - 24.77% xlog_cil_commit - 24.27% xfs_log_ticket_ungrant - 24.23% xfs_log_space_wake - 23.15% _raw_spin_lock - do_raw_spin_lock 23.14% __pv_queued_spin_lock_slowpath - 1.04% xlog_grant_head_wake - 1.01% wake_up_process - 0.99% try_to_wake_up - 0.84% _raw_spin_lock_irqsave do_raw_spin_lock - 1.14% xfs_bmapi_write - 1.06% xfs_bmapi_convert_unwritten 1.04% xfs_bmap_add_extent_unwritten_real And because there are 1024 threads banging on that transaction reservation, we get severe lock contention. As a result of this, until all the files have been converted to written, the workload only runs at 40-50k write IOPS while burning tens of CPUs on lock contention. As the random writes hit converted extents more often, the IOPS rate goes up because those writes don't need conversion at completion. After 5 minutes of runtime on my machine, it's up to about 100kIOPs. But, really, performance is nowhere near what the hardware can produce because of the unwritten extent conversion behaviour under these workloads - unbound parallelism is bad for performance. After 12 minutes, all the files are converted and it runs at 1607kIOPs (1.6 million IOPS) which is the limit the hardware I have can support. End result was: write: IOPS=206k, BW=805MiB/s (844MB/s)(708GiB/900009msec); 0 zone resets Crap performance compared to what the hardware can do. So, if we add '-fallocate=none' and re-run the workload immediately, it runs at 1.6MIOPS right form the start, because the 4GB work files are already on disk and fully written: write: IOPS=1593k, BW=6222MiB/s (6525MB/s)(2844GiB/468091msec); 0 zone resets So the perf problem you are seeing here is unbound concurrency in unwritten extent conversion running the journal out of reservation space, not a problem with the XFS journal performance.... > on an XFS instance having the following geometry, > > # xfs_info /dev/tank/lvm > meta-data=/dev/mapper/tank-lvm isize=512 agcount=32, agsize=97675376 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 > data = bsize=4096 blocks=3125612032, imaxpct=5 > = sunit=16 swidth=64 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=16 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # dmsetup table > tank-lvm: 0 25004900352 striped 4 128 259:3 2048 259:2 2048 259:5 2048 259:8 2048 > > # lsblk > nvme0n1 259:0 0 2.9T 0 disk > └─nvme0n1p1 259:3 0 2.9T 0 part > └─tank-lvm 252:0 0 11.7T 0 lvm /data > nvme1n1 259:1 0 2.9T 0 disk > └─nvme1n1p1 259:2 0 2.9T 0 part > └─tank-lvm 252:0 0 11.7T 0 lvm /data > nvme2n1 259:4 0 2.9T 0 disk > └─nvme2n1p1 259:5 0 2.9T 0 part > └─tank-lvm 252:0 0 11.7T 0 lvm /data > nvme3n1 259:6 0 2.9T 0 disk > └─nvme3n1p1 259:8 0 2.9T 0 part > └─tank-lvm 252:0 0 11.7T 0 lvm /data Yup, I just got the above numbers and profile from a 4-way RAID0 stripe of intel DC4800 SSDs, which can do (combined) about 1.6 million random 4kB writes. > ... The following results are produced > > ------------------------------------------------------ > Kernel Write IOPS > ------------------------------------------------------ > v5.4 1050K > b843299ba5f9a430dd26ecd02ee2fef805f19844 1040k > 0e7ab7efe77451cba4cbecb6c9f5ef83cf32b36b 835k > v5.17-rc4 909k > ------------------------------------------------------ > > The commit 0e7ab7efe77451cba4cbecb6c9f5ef83cf32b36b (xfs: Throttle commits on > delayed background CIL push) causes tasks (which commit transactions to the > CIL) to get blocked (if cil->xc_ctx->space_used >= > XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) until the CIL push worker is executed. Yup, that fixed a CIL space overrun violation, which could cause the journal on disk to become corrupted because it would allow the CIL to overcommit the journal space on disk and so we could have the head of the log overwrite the tail of the log before the metadata at the tail of the log had been written back to disk. IOWs, the speed that happened before this commit was because the CIL batch sizes were allowed to grow illegally large and not bound so could exceed half the size of journal. We must have two complete checkpoints in the log at any given time for recovery to be able to find the head and tail to recover them. If we get two checkpoints in a row that are larger than half the log, the second checkpoint overwrites the tail of previous checkpoint and we corrupt the log. This occurred when the CIL push worker gets delayed (e.g. like when thousands of pending unwritten extent conversion drowns the system in thousands of busy kworkers), because it allowed the CIL to keep growing right up till the push kworker got scheduled and ran. It's no surprise this makes this fio workload run slower - without the patch it is likely the fio workload is violating journal checkpoint size limits and if we crash at that point log recovery will then fail and lots of data will be lost... As I always say: correctness first, performance second. If we need to improve or restore performance on this workload, we need to address the behaviour of unwritten extent conversion and reduce the overhead of performing millions of tiny, independent conversion transactions like this. A good start would be to bound the maximum concurrency of workqueue based DIO IO completion to stop if from being able to fully exhaust log reservations with blocked conversions (e.g. one completion worker thread per inode, not 1 completion thread per IO). Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx