Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Feb 2025 18:08:56 +1100

On Tue, Feb 04, 2025 at 06:06:42AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > I doubt that will create enough concurrency for a typical small
> > server or desktop machine that only has a single NUMA node but has a
> > couple of fast nvme SSDs in it.
> > 
> > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > 3) NUMCPU/N number of writeback contexts.
> > 
> > These don't take into account the concurrency available from
> > the underlying filesystem or storage.
> > 
> > That's the point I was making - CPU count has -zero- relationship to
> > the concurrency the filesystem and/or storage provide the system. It
> > is fundamentally incorrect to base decisions about IO concurrency on
> > the number of CPU cores in the system.
> 
> Yes.  But as mention in my initial reply there is a use case for more
> WB threads than fs writeback contexts, which is when the writeback

I understand that - there's more that one reason we may want
multiple IO dispatch threads per writeback context.

> threads do CPU intensive work like compression.  Being able to do that
> from normal writeback threads vs forking out out to fs level threads
> would really simply the btrfs code a lot.  Not really interesting for
> XFS right now of course.
> 
> Or in other words: fs / device geometry really should be the main
> driver, but if a file systems supports compression (or really expensive
> data checksums) being able to scale up the numbes of threads per
> context might still make sense.  But that's really the advanced part,
> we'll need to get the fs geometry aligned to work first.
> 
> > > > XFS largely stem from the per-IO cpu overhead of block allocation in
> > > > the filesystems (i.e. delayed allocation).
> > > 
> > > This is a good idea, but it means we will not be able to paralellize within
> > > an AG.
> > 
> > The XFS allocation concurrency model scales out across AGs, not
> > within AGs. If we need writeback concurrency within an AG (e.g. for
> > pure overwrite concurrency) then we want async dispatch worker
> > threads per writeback queue, not multiple writeback queues per
> > AG....
> 
> Unless the computational work mentioned above is involved there would be
> something really wrong if we're saturating a CPU per AG.

... and the common case I'm thinking of here is stalling writeback
on the AGF lock. i.e. another non-writeback task holding the AGF
lock - inode cluster allocation, directory block allocation, freeing
space in that AG, etc - will stall any writeback that requires
allocation in that AG if we don't have multiple dispatch threads per
writeback context available.

That, IMO, is future optimisation; just moving to a writeback
context per AG would solve a lot of the current "single thread CPU
bound" writeback throughput limitations we have right now,
especially when it comes to writing lots of small files across
many directories in a very short period of time (i.e. opening
tarballs, rsync, recursive dir copies, etc).

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx