On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote: > Greetings everyone, > > Anuj and I would like to present our proposal for implementing parallel > writeback at LSF/MM. This approach could address the writeback performance > issue discussed by Luis last year [1]. > > Currently, the pagecache writeback is executed in a single-threaded manner. > Inodes are added to the dirty list, and the delayed writeback is scheduled. > The single-threaded writeback then iterates through the dirty list and performs > the writeback operations. > > We have developed a prototype implementation of parallel writeback, where we > have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time, > and b_more_io lists have also been modified to be per-CPU. When an inode needs > to be added to the b_dirty list, we select the next CPU (in a round-robin > fashion) and schedule the per-CPU writeback work on the selected CPU. My own experiments and experience tell me that per-cpu is not the right architecture for expanding writeback concurrency. The actual writeback parallelism that can be acheived is determined by the underlying filesystem architecture and geometry, not the CPU layout of the machine. I think some background info on writeback concurrency is also necessary before we go any further. If we go back many years to before we had the current write throttling architecture, we go back to an age where balance_dirty_pages() throttled by submitting foreground IO itself. It was throttled by the block layer submission queue running out of slots rather than any specific performance measurement. Under memory pressure, this lead highly concurrent dispatch of dirty pages, but very poor selection of dirty inodes/pages to dispatch. The result was that this foreground submission would often dispatch single page IOs, and issue tens of them concurrently across all user processes that were trying to write data. IOWs, throttling caused "IO breakdown" where the IO pattern ended up being small random IO exactly when we need to perform bulk cleaning of dirty pages. We needed bulk page cleaning throughput, not random dirty page shootdown. When we moved to "writeback is only done by the single BDI flusher thread" architecture, we significantly improved system behaviour under heavy load because writeback now had a tight focus issuing IO from a single inode at a time as efficiently as possible. i.e. when we had large sequential dirty file regions, the writeback IO was all large sequential IO, and that didn't get chopped up by another hundred processes issuing IO from random other inodes at the same time. IOWs, having too much parallelism in writeback for the underlying storage and/or filesystem can be far more harmful to system performance under load than having too little parallelism to drive the filesystem/hardware to it's maximum performance. This is an underlying constraint that everyone needs to understand before discussing writeback concurrency changes: excessive writeback concurrency can result in worse outcomes than having no writeback concurrency under sustained heavy production loads.... > With the per-CPU threads handling the writeback, we have observed a significant > increase in IOPS. Here is a test and comparison between the older writeback and > the newer parallel writeback, using XFS filesystem: > https://github.com/kundanthebest/parallel_writeback/blob/main/README.md What kernel is that from? > In our tests, we have found that with this implementation the buffered I/O > IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe > (Optane) and PMEM. Buffered IO often goes faster than direct for these workloads because the buffered IO is aggregated in memory to much larger, less random sequential IOs at writeback time. Writing the data in fewer, larger, better ordered IOs is almost always faster (at both the filesystem and storage hardware layers) than a pure 4kB random IO submission pattern. i.e. with enough RAM, this random write workload using buffered IO is pretty much guaranteed to outperform direct IO regardless of the underlying writeback concurrency. > There are a few items that we would like to discuss: > > During the implementation, we noticed several ways in which the writeback IOs > are throttled: > A) balance_dirty_pages > B) writeback_chunk_size > C) wb_over_bg_thresh > Additionally, there are delayed writeback executions in the form of > dirty_writeback_centisecs and dirty_expire_centisecs. > > With the introduction of per-CPU writeback, we need to determine the > appropriate way to adjust the throttling and delayed writeback settings. Yup, that's kinda my point - residency of the dirty data set in the page cache influences throughput in a workload like this more than writeback parallelism does. > Lock contention: > We have observed that the writeback list_lock and the inode->i_lock are the > locks that are most likely to cause contention when we make the thread per-CPU. You clearly aren't driving the system hard enough with small IOs. :) Once you load up the system with lots of small files (e.g. with fsmark to create millions of 4kB files concurrently) and use flush-on-close writeback semantics to drive writeback IO submission concurrency, the profiles are typically dominated by folio allocation and freeing in the page cache and XFS journal free space management (i.e. transaction reservation management around timestamp updates, block allocation and unwritten extent conversion.) That is, writeback parallelism in XFS is determined by the underlying allocation parallelism of the filesystem and journal size/throughput because it uses delayed allocation (same goes for ext4 and btrfs). In the case of XFS, allocation parallelism is determined by the filesytem geometry - the number of allocation groups created at mkfs time. So if you have 4 allocation groups (AGs) in XFS (typical default for small filesystems), then you can only be doing 4 extent allocation operations at once. Hence it doesn't matter how many concurrent writeback contexts the system has, the filesystem will limit concurrency in many workloads to just 4. i.e. the filesystem geometry determines writeback concurrency, not the number of CPUs in the system. Not every writeback will require allocation in XFS (specualtive delalloc beyond EOF greatly reduces this overhead) but when you have hundreds of thousands of small files that each require allocation (e.g. cloning or untarring a large source tree), then writeback has a 1:1 ratio between per-inode writeback and extent allocation. This is the case where filesystem writeback concurrency really matters... > Our next task will be to convert the writeback list_lock to a per-CPU lock. > Another potential improvement could be to assign a specific CPU to an inode. As I mentioned above, per-cpu writeback doesn't seem like the right architecture to me. IMO, we need writeback to be optimised for is asynchronous IO dispatch through each filesystems; our writeback IOPS problems in XFS largely stem from the per-IO cpu overhead of block allocation in the filesystems (i.e. delayed allocation). Because of the way we do allocation locality in XFS, we always try to allocate data extents in the same allocation group as the inode is located. Hence if we have a writeback context per AG, and we always know what AG an inode is located in, we always know which writeback context is should belong to. Hence what a filesystem like XFS needs is the ability to say to the writeback infrastructure "I need N writeback contexts" and then have the ability to control which writeback context an inode is attached to. A writeback context would be an entirely self contained object with dirty lists, locks to protect the lists, a (set of) worker threads that does the writeback dispatch, etc. Then the VFS will naturally write back all the inodes in a given AG indepedently to the inodes in any other AG, and there will be almost no serialisation on allocation or IO submission between writeback contexts. And there won't be any excessive writeback concurrency at the filesystem layer so contention problems during writeback go away, too. IOWs, for XFS we really don't want CPU aligned writeback lists, we want filesystem geometry aligned writeback contexts... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx