Greetings everyone, Anuj and I would like to present our proposal for implementing parallel writeback at LSF/MM. This approach could address the writeback performance issue discussed by Luis last year [1]. Currently, the pagecache writeback is executed in a single-threaded manner. Inodes are added to the dirty list, and the delayed writeback is scheduled. The single-threaded writeback then iterates through the dirty list and performs the writeback operations. We have developed a prototype implementation of parallel writeback, where we have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time, and b_more_io lists have also been modified to be per-CPU. When an inode needs to be added to the b_dirty list, we select the next CPU (in a round-robin fashion) and schedule the per-CPU writeback work on the selected CPU. With the per-CPU threads handling the writeback, we have observed a significant increase in IOPS. Here is a test and comparison between the older writeback and the newer parallel writeback, using XFS filesystem: https://github.com/kundanthebest/parallel_writeback/blob/main/README.md In our tests, we have found that with this implementation the buffered I/O IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe (Optane) and PMEM. There are a few items that we would like to discuss: During the implementation, we noticed several ways in which the writeback IOs are throttled: A) balance_dirty_pages B) writeback_chunk_size C) wb_over_bg_thresh Additionally, there are delayed writeback executions in the form of dirty_writeback_centisecs and dirty_expire_centisecs. With the introduction of per-CPU writeback, we need to determine the appropriate way to adjust the throttling and delayed writeback settings. Lock contention: We have observed that the writeback list_lock and the inode->i_lock are the locks that are most likely to cause contention when we make the thread per-CPU. Our next task will be to convert the writeback list_lock to a per-CPU lock. Another potential improvement could be to assign a specific CPU to an inode. We are preparing the RFC to depict the changes and planning to submit the same soon. [1] https://lore.kernel.org/linux-mm/Zd5lORiOCUsARPWq@xxxxxxxxxxxxxxxxxxx/T/ -- 2.25.1