Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Kundan Kumar <kundan.kumar@xxxxxxxxxxx> · Fri, 31 Jan 2025 15:02:09 +0530

IOWs, having too much parallelism in writeback for the underlying
storage and/or filesystem can be far more harmful to system
performance under load than having too little parallelism to drive
the filesystem/hardware to it's maximum performance.

With increasing speed of devices we would like to improve the performance of
buffered IO as well. This will help the applications(DB, AI/ML) using buffered
I/O. If more parallelism is causing side effect, we can reduce it using some
factor like:
1) writeback context per NUMA node.
2) Fixed number of writeback contexts, say min(10, numcpu).
3) NUMCPU/N number of writeback contexts.
4) Writeback context based on FS geometry like per AG for XFS, as per your
  suggestion.

What kernel is that from?

6.12.0-rc4+ 
commit d165768847839f8d1ae5f8081ecc018a190d50e8

i.e. with enough RAM, this random write workload using buffered IO
is pretty much guaranteed to outperform direct IO regardless of the
underlying writeback concurrency.

We tested making sure RAM is available for both buffered and direct IO. On a
system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device.

fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \
--ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \
--direct=1 --eta-interval=1 --eta-newline=1 --group_reporting

We can see the results which show direct IO exceed buffered IO by big margin.

BW (MiB/s)         buffered dontcache %improvement  direct  %improvement
randwrite (bs=4k)    3393    5397	 59.06%	    9315      174.53%

IMO, we need writeback to be optimised for is asynchronous IO
dispatch through each filesystems; our writeback IOPS problems in
XFS largely stem from the per-IO cpu overhead of block allocation in
the filesystems (i.e. delayed allocation).

This is a good idea, but it means we will not be able to paralellize within
an AG. I will spend some time to build a POC with per AG writeback context, and
compare it with per-cpu writeback performance and extent fragmentation. Other
filesystems using delayed allocation will also need a similar scheme.