Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Christoph Hellwig <hch@xxxxxx> · Tue, 4 Feb 2025 06:06:42 +0100

On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> I doubt that will create enough concurrency for a typical small
> server or desktop machine that only has a single NUMA node but has a
> couple of fast nvme SSDs in it.
> 
> > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > 3) NUMCPU/N number of writeback contexts.
> 
> These don't take into account the concurrency available from
> the underlying filesystem or storage.
> 
> That's the point I was making - CPU count has -zero- relationship to
> the concurrency the filesystem and/or storage provide the system. It
> is fundamentally incorrect to base decisions about IO concurrency on
> the number of CPU cores in the system.

Yes.  But as mention in my initial reply there is a use case for more
WB threads than fs writeback contexts, which is when the writeback
threads do CPU intensive work like compression.  Being able to do that
from normal writeback threads vs forking out out to fs level threads
would really simply the btrfs code a lot.  Not really interesting for
XFS right now of course.

Or in other words: fs / device geometry really should be the main
driver, but if a file systems supports compression (or really expensive
data checksums) being able to scale up the numbes of threads per
context might still make sense.  But that's really the advanced part,
we'll need to get the fs geometry aligned to work first.

> > > XFS largely stem from the per-IO cpu overhead of block allocation in
> > > the filesystems (i.e. delayed allocation).
> > 
> > This is a good idea, but it means we will not be able to paralellize within
> > an AG.
> 
> The XFS allocation concurrency model scales out across AGs, not
> within AGs. If we need writeback concurrency within an AG (e.g. for
> pure overwrite concurrency) then we want async dispatch worker
> threads per writeback queue, not multiple writeback queues per
> AG....

Unless the computational work mentioned above is involved there would be
something really wrong if we're saturating a CPU per AG.