Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Feb 2025 13:50:08 +1100

On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote:
> > IOWs, having too much parallelism in writeback for the underlying
> > storage and/or filesystem can be far more harmful to system
> > performance under load than having too little parallelism to drive
> > the filesystem/hardware to it's maximum performance.
> 
> With increasing speed of devices we would like to improve the performance of
> buffered IO as well. This will help the applications(DB, AI/ML) using buffered
> I/O. If more parallelism is causing side effect, we can reduce it using some
> factor like:
> 1) writeback context per NUMA node.

I doubt that will create enough concurrency for a typical small
server or desktop machine that only has a single NUMA node but has a
couple of fast nvme SSDs in it.

> 2) Fixed number of writeback contexts, say min(10, numcpu).
> 3) NUMCPU/N number of writeback contexts.

These don't take into account the concurrency available from
the underlying filesystem or storage.

That's the point I was making - CPU count has -zero- relationship to
the concurrency the filesystem and/or storage provide the system. It
is fundamentally incorrect to base decisions about IO concurrency on
the number of CPU cores in the system.

Same goes for using NUMA nodes, a fixed minimum, etc.

> 4) Writeback context based on FS geometry like per AG for XFS, as per your
>   suggestion.

Leave the concurrency and queue affinity selection up to the
filesystem.  If they don't specify anything, then we remain with a
single queue and writeback thread.

> > i.e. with enough RAM, this random write workload using buffered IO
> > is pretty much guaranteed to outperform direct IO regardless of the
> > underlying writeback concurrency.
> 
> We tested making sure RAM is available for both buffered and direct IO. On a
> system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device.
> 
> fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \
> --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \
> --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting
> 
> We can see the results which show direct IO exceed buffered IO by big margin.
> 
> BW (MiB/s)         buffered dontcache %improvement  direct  %improvement
> randwrite (bs=4k)    3393    5397	 59.06%	    9315      174.53%

DIO is faster on PMEM because it only copies the data once
(i.e. user to PMEM) whilst buffered IO copies the data twice (user
to page cache to PMEM). The IO has exactly the same CPU overhead
regardless of in-memory page cache aggregation for buffered IO
because IO cpu usage is dominated by the data copy overhead.

Throughput benchmarks on PMEM are -always- CPU and memory bandwidth
bound doing synchronous data copying.  PMEM benchamrks may respond
to CPU count based optimisations, but that should not be taken as
"infrastructure needs to be CPU-count optimised.

This is why it is fundamentally wrong to use PMEM for IO
infrastructure optimisation: the behaviour it displays is completely
different to pretty much all other types of IO device we support.
PMEM is a niche, and it performs very poorly compared to a handful
of NVMe SSDs being driven by the same CPUs...

> > IMO, we need writeback to be optimised for is asynchronous IO
> > dispatch through each filesystems; our writeback IOPS problems in
> > XFS largely stem from the per-IO cpu overhead of block allocation in
> > the filesystems (i.e. delayed allocation).
> 
> This is a good idea, but it means we will not be able to paralellize within
> an AG.

The XFS allocation concurrency model scales out across AGs, not
within AGs. If we need writeback concurrency within an AG (e.g. for
pure overwrite concurrency) then we want async dispatch worker
threads per writeback queue, not multiple writeback queues per
AG....

> I will spend some time to build a POC with per AG writeback context, and
> compare it with per-cpu writeback performance and extent fragmentation. Other
> filesystems using delayed allocation will also need a similar scheme.

The two are not exclusive like you seem to think they are.

If you want to optimise your filesystem geometry for CPU count based
concurrency, then on XFS all you need to do is this:

# mkfs.xfs -d concurrency=nr_cpus /dev/pmem0

mkfs will then create a geometry that has sufficient concurrency for
the number of CPUs in the host system.  Hence we end up with -all-
in-memory and on-disk filesystem structures on that PMEM device
being CPU count optimised, not just writeback.

Writeback is just a small part of a much larger system, and we
already optimise filesystem and IO concurrency via filesystem
geometry.  Writeback needs to work within those same controls we
already provide users with.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx