Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Jan Kara <jack@xxxxxxx> · Thu, 13 Mar 2025 21:22:00 +0100

On Thu 20-02-25 19:49:22, Kundan Kumar wrote:
> > Well, that's currently selected by __inode_attach_wb() based on
> > whether there is a memcg attached to the folio/task being dirtied or
> > not. If there isn't a cgroup based writeback task, then it uses the
> > bdi->wb as the wb context.
> 
> We have created a proof of concept for per-AG context-based writeback, as
> described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
> the filesystem handler, __mark_inode_dirty() selects writeback context
> corresponding to the inode.
> 
> We attempted to handle memcg and bdi based writeback in a similar manner.
> This approach aims to maintain the original writeback semantics while
> providing parallelism. This helps in pushing more data early to the
> device, trying to ease the write pressure faster.
> [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/

Yeah, I've seen the patches. Sorry for not getting to you earlier.

> > Then selecting inodes for writeback becomes a list_lru_walk()
> > variant depending on what needs to be written back (e.g. physical
> > node, memcg, both, everything that is dirty everywhere, etc).
> 
> We considered using list_lru to track inodes within a writeback context.
> This can be implemented as:
> struct bdi_writeback {
>  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
>  struct list_lru b_io_dirty_inodes_lru;
>  ...
>  ...
> };
> By doing this, we would obtain a sharded list of inodes per NUMA node.

I think you've misunderstood Dave's suggestion here. list_lru was given as
an example of a structure for inspiration. We cannot take it directly as is
for writeback purposes because we don't want to be sharding based on NUMA
nodes but rather based on some other (likely FS driven) criteria.

> However, we would also need per-NUMA writeback contexts. Otherwise,
> even if inodes are NUMA-sharded, a single writeback context would stil
> process them sequentially, limiting parallelism. But there’s a concern:
> NUMA-based writeback contexts are not aligned with filesystem geometry,
> which could negatively impact delayed allocation and writeback efficiency,
> as you pointed out in your previous reply [2].
> 
> Would it be better to let the filesystem dictate the number of writeback
> threads, rather than enforcing a per-NUMA model?

I was thinking about how to best parallelize the writeback and I think
there are two quite different demands for which we probably want two
different levels of parallelism.

One case is the situation when the filesystem for example has multiple
underlying devices (like btrfs or bcachefs) or for other reasons writeback
to different parts is fairly independent (like for different XFS AGs). Here
we want parallelism at rather high level I think including separate
dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
separate bdis (struct backing_dev_info) but I think it would be technically
and also conceptually somewhat easier to do the multiplexing by factoring
out:

        struct bdi_writeback wb;  /* the root writeback info for this bdi */
        struct list_head wb_list; /* list of all wbs */
#ifdef CONFIG_CGROUP_WRITEBACK
        struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
        struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
#endif
        wait_queue_head_t wb_waitq;

into a new structure (looking for a good name - bdi_writeback_context???)
that can get multiplied (filesystem can create its own bdi on mount and
configure there number of bdi_writeback_contexts it wants). We also need to
add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
which will return appropriate bdi_writeback_context (or perhaps just it's
index?) for an inode. This will be used by the filesystem to direct
writeback code where the inode should go. This is kind of what Kundan did
in the last revision of his patches but I hope this approach should
somewhat limit the changes necessary to writeback infrastructure - the
patch 2 in his series is really unreviewably large...

Then another case is a situation where either the amount of CPU work is
rather high for IO submission (cases like Christoph mentioned where
filesystem needs to do checksumming on submission or similar) or simply the
device is rather fast for a single submission thread and the FS doesn't
have a sensible way to partition inodes (e.g. for ext4 there's no
meaningful way of partitioning inodes into independent groups - ext4
allocation groups are small and inodes often span multiple groups and the
sets of groups used by different inodes randomly overlap). In this case I
think we want single dirty throttling instance, single writeback throughput
estimation, single set of dirty inode lists etc. The level where the
parallelism needs to happen is fairly low - I'd say duplicate:

	struct delayed_work dwork;      /* work item used for writeback */

in struct bdi_writeback. Again, the number of dworks should be configurable
when creating bdi for the filesystem. 

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR