IOWs, having too much parallelism in writeback for the underlying storage and/or filesystem can be far more harmful to system performance under load than having too little parallelism to drive the filesystem/hardware to it's maximum performance.
With increasing speed of devices we would like to improve the performance of buffered IO as well. This will help the applications(DB, AI/ML) using buffered I/O. If more parallelism is causing side effect, we can reduce it using some factor like: 1) writeback context per NUMA node. 2) Fixed number of writeback contexts, say min(10, numcpu). 3) NUMCPU/N number of writeback contexts. 4) Writeback context based on FS geometry like per AG for XFS, as per your suggestion.
What kernel is that from?
6.12.0-rc4+ commit d165768847839f8d1ae5f8081ecc018a190d50e8
i.e. with enough RAM, this random write workload using buffered IO is pretty much guaranteed to outperform direct IO regardless of the underlying writeback concurrency.
We tested making sure RAM is available for both buffered and direct IO. On a system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device. fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \ --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \ --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting We can see the results which show direct IO exceed buffered IO by big margin. BW (MiB/s) buffered dontcache %improvement direct %improvement randwrite (bs=4k) 3393 5397 59.06% 9315 174.53%
IMO, we need writeback to be optimised for is asynchronous IO dispatch through each filesystems; our writeback IOPS problems in XFS largely stem from the per-IO cpu overhead of block allocation in the filesystems (i.e. delayed allocation).
This is a good idea, but it means we will not be able to paralellize within an AG. I will spend some time to build a POC with per AG writeback context, and compare it with per-cpu writeback performance and extent fragmentation. Other filesystems using delayed allocation will also need a similar scheme.