Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Kundan Kumar <kundan.kumar@xxxxxxxxxxx> · Tue, 18 Mar 2025 12:11:34 +0530

> Then selecting inodes for writeback becomes a list_lru_walk()
> variant depending on what needs to be written back (e.g. physical
> node, memcg, both, everything that is dirty everywhere, etc).

We considered using list_lru to track inodes within a writeback context.
This can be implemented as:
struct bdi_writeback {
 struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
 struct list_lru b_io_dirty_inodes_lru;
 ...
 ...
};
By doing this, we would obtain a sharded list of inodes per NUMA node.

I think you've misunderstood Dave's suggestion here. list_lru was given as
an example of a structure for inspiration. We cannot take it directly as is
for writeback purposes because we don't want to be sharding based on NUMA
nodes but rather based on some other (likely FS driven) criteria.

Makes sense. Thanks for the clarification.

I was thinking about how to best parallelize the writeback and I think
there are two quite different demands for which we probably want two
different levels of parallelism.

One case is the situation when the filesystem for example has multiple
underlying devices (like btrfs or bcachefs) or for other reasons writeback
to different parts is fairly independent (like for different XFS AGs). Here
we want parallelism at rather high level I think including separate
dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
separate bdis (struct backing_dev_info) but I think it would be technically
and also conceptually somewhat easier to do the multiplexing by factoring
out:

       struct bdi_writeback wb;  /* the root writeback info for this bdi */
       struct list_head wb_list; /* list of all wbs */
#ifdef CONFIG_CGROUP_WRITEBACK
       struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
       struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
#endif
       wait_queue_head_t wb_waitq;

into a new structure (looking for a good name - bdi_writeback_context???)
that can get multiplied (filesystem can create its own bdi on mount and
configure there number of bdi_writeback_contexts it wants). We also need to
add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
which will return appropriate bdi_writeback_context (or perhaps just it's
index?) for an inode. This will be used by the filesystem to direct
writeback code where the inode should go. This is kind of what Kundan did
in the last revision of his patches but I hope this approach should
somewhat limit the changes necessary to writeback infrastructure - the

This looks much better than the data structures we had in previous
version. I will prepare a new version based on this feedback.

patch 2 in his series is really unreviewably large...

I agree. Sorry, will try to streamline the patches in a better fashion
in the next iteration.