[LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Kundan Kumar <kundan.kumar@xxxxxxxxxxx> · Wed, 29 Jan 2025 15:56:27 +0530

Greetings everyone,

Anuj and I would like to present our proposal for implementing parallel
writeback at LSF/MM. This approach could address the writeback performance
issue discussed by Luis last year [1].

Currently, the pagecache writeback is executed in a single-threaded manner.
Inodes are added to the dirty list, and the delayed writeback is scheduled.
The single-threaded writeback then iterates through the dirty list and performs
the writeback operations.

We have developed a prototype implementation of parallel writeback, where we
have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time,
and b_more_io lists have also been modified to be per-CPU. When an inode needs
to be added to the b_dirty list, we select the next CPU (in a round-robin
fashion) and schedule the per-CPU writeback work on the selected CPU.

With the per-CPU threads handling the writeback, we have observed a significant
increase in IOPS. Here is a test and comparison between the older writeback and
the newer parallel writeback, using XFS filesystem:
https://github.com/kundanthebest/parallel_writeback/blob/main/README.md

In our tests, we have found that with this implementation the buffered I/O
IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe
(Optane) and PMEM.

There are a few items that we would like to discuss:

During the implementation, we noticed several ways in which the writeback IOs
are throttled:
A) balance_dirty_pages
B) writeback_chunk_size
C) wb_over_bg_thresh
Additionally, there are delayed writeback executions in the form of
dirty_writeback_centisecs and dirty_expire_centisecs.

With the introduction of per-CPU writeback, we need to determine the
appropriate way to adjust the throttling and delayed writeback settings.

Lock contention:
We have observed that the writeback list_lock and the inode->i_lock are the
locks that are most likely to cause contention when we make the thread per-CPU.
Our next task will be to convert the writeback list_lock to a per-CPU lock.
Another potential improvement could be to assign a specific CPU to an inode.

We are preparing the RFC to depict the changes and planning to submit the same
soon.

[1] https://lore.kernel.org/linux-mm/Zd5lORiOCUsARPWq@xxxxxxxxxxxxxxxxxxx/T/

-- 
2.25.1