Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 30 Jan 2025 09:51:43 +1100

On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote:
> Greetings everyone,
> 
> Anuj and I would like to present our proposal for implementing parallel
> writeback at LSF/MM. This approach could address the writeback performance
> issue discussed by Luis last year [1].
> 
> Currently, the pagecache writeback is executed in a single-threaded manner.
> Inodes are added to the dirty list, and the delayed writeback is scheduled.
> The single-threaded writeback then iterates through the dirty list and performs
> the writeback operations.
> 
> We have developed a prototype implementation of parallel writeback, where we
> have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time,
> and b_more_io lists have also been modified to be per-CPU. When an inode needs
> to be added to the b_dirty list, we select the next CPU (in a round-robin
> fashion) and schedule the per-CPU writeback work on the selected CPU.

My own experiments and experience tell me that per-cpu is not
the right architecture for expanding writeback concurrency.

The actual writeback parallelism that can be acheived is determined
by the underlying filesystem architecture and geometry, not the CPU
layout of the machine.

I think some background info on writeback concurrency is also
necessary before we go any further.

If we go back many years to before we had the current
write throttling architecture, we go back to an age where balance_dirty_pages()
throttled by submitting foreground IO itself. It was throttled by
the block layer submission queue running out of slots rather than
any specific performance measurement. Under memory pressure, this
lead highly concurrent dispatch of dirty pages, but very poor
selection of dirty inodes/pages to dispatch. The result was that
this foreground submission would often dispatch single page IOs,
and issue tens of them concurrently across all user processes that
were trying to write data. IOWs, throttling caused "IO breakdown"
where the IO pattern ended up being small random IO exactly when
we need to perform bulk cleaning of dirty pages. We needed bulk
page cleaning throughput, not random dirty page shootdown.

When we moved to "writeback is only done by the single BDI flusher
thread" architecture, we significantly improved system behaviour
under heavy load because writeback now had a tight focus issuing IO
from a single inode at a time as efficiently as possible. i.e.  when
we had large sequential dirty file regions, the writeback IO was all
large sequential IO, and that didn't get chopped up by another
hundred processes issuing IO from random other inodes at the same
time.

IOWs, having too much parallelism in writeback for the underlying
storage and/or filesystem can be far more harmful to system
performance under load than having too little parallelism to drive
the filesystem/hardware to it's maximum performance.

This is an underlying constraint that everyone needs to understand
before discussing writeback concurrency changes: excessive
writeback concurrency can result in worse outcomes than having no
writeback concurrency under sustained heavy production loads....

> With the per-CPU threads handling the writeback, we have observed a significant
> increase in IOPS. Here is a test and comparison between the older writeback and
> the newer parallel writeback, using XFS filesystem:
> https://github.com/kundanthebest/parallel_writeback/blob/main/README.md

What kernel is that from?

> In our tests, we have found that with this implementation the buffered I/O
> IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe
> (Optane) and PMEM.

Buffered IO often goes faster than direct for these workloads
because the buffered IO is aggregated in memory to much larger, less
random sequential IOs at writeback time.  Writing the data in fewer,
larger, better ordered IOs is almost always faster (at both the
filesystem and storage hardware layers) than a pure 4kB random IO
submission pattern.

i.e. with enough RAM, this random write workload using buffered IO
is pretty much guaranteed to outperform direct IO regardless of the
underlying writeback concurrency.

> There are a few items that we would like to discuss:
> 
> During the implementation, we noticed several ways in which the writeback IOs
> are throttled:
> A) balance_dirty_pages
> B) writeback_chunk_size
> C) wb_over_bg_thresh
> Additionally, there are delayed writeback executions in the form of
> dirty_writeback_centisecs and dirty_expire_centisecs.
>
> With the introduction of per-CPU writeback, we need to determine the
> appropriate way to adjust the throttling and delayed writeback settings.

Yup, that's kinda my point - residency of the dirty data set in the
page cache influences throughput in a workload like this more than
writeback parallelism does.

> Lock contention:
> We have observed that the writeback list_lock and the inode->i_lock are the
> locks that are most likely to cause contention when we make the thread per-CPU.

You clearly aren't driving the system hard enough with small IOs. :)

Once you load up the system with lots of small files (e.g.  with
fsmark to create millions of 4kB files concurrently) and use
flush-on-close writeback semantics to drive writeback IO submission
concurrency, the profiles are typically dominated by folio
allocation and freeing in the page cache and XFS journal free space
management (i.e. transaction reservation management around timestamp
updates, block allocation and unwritten extent conversion.)

That is, writeback parallelism in XFS is determined by the
underlying allocation parallelism of the filesystem and journal
size/throughput because it uses delayed allocation (same goes for
ext4 and btrfs). In the case of XFS, allocation parallelism is
determined by the filesytem geometry - the number of allocation
groups created at mkfs time.

So if you have 4 allocation groups (AGs) in XFS (typical default for
small filesystems), then you can only be doing 4 extent allocation
operations at once. Hence it doesn't matter how many concurrent
writeback contexts the system has, the filesystem will limit
concurrency in many workloads to just 4. i.e. the filesystem
geometry determines writeback concurrency, not the number of CPUs in
the system.

Not every writeback will require allocation in XFS (specualtive
delalloc beyond EOF greatly reduces this overhead) but when you have
hundreds of thousands of small files that each require allocation
(e.g. cloning or untarring a large source tree), then writeback has
a 1:1 ratio between per-inode writeback and extent allocation. This
is the case where filesystem writeback concurrency really matters...

> Our next task will be to convert the writeback list_lock to a per-CPU lock.
> Another potential improvement could be to assign a specific CPU to an inode.

As I mentioned above, per-cpu writeback doesn't seem like the right
architecture to me.

IMO, we need writeback to be optimised for is asynchronous IO
dispatch through each filesystems; our writeback IOPS problems in
XFS largely stem from the per-IO cpu overhead of block allocation in
the filesystems (i.e. delayed allocation).

Because of the way we do allocation locality in XFS, we always try
to allocate data extents in the same allocation group as the inode
is located.  Hence if we have a writeback context per AG, and we
always know what AG an inode is located in, we always know which
writeback context is should belong to.

Hence what a filesystem like XFS needs is the ability to say to the
writeback infrastructure "I need N writeback contexts" and then have
the ability to control which writeback context an inode is attached
to.

A writeback context would be an entirely self contained object with
dirty lists, locks to protect the lists, a (set of) worker threads
that does the writeback dispatch, etc.

Then the VFS will naturally write back all the inodes in a given AG
indepedently to the inodes in any other AG, and there will be almost
no serialisation on allocation or IO submission between writeback
contexts. And there won't be any excessive writeback concurrency at
the filesystem layer so contention problems during writeback go
away, too.

IOWs, for XFS we really don't want CPU aligned writeback lists, we
want filesystem geometry aligned writeback contexts...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx