Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Mar 2025 14:42:18 +1100

On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote:
> On Thu 20-02-25 19:49:22, Kundan Kumar wrote:
> > > Well, that's currently selected by __inode_attach_wb() based on
> > > whether there is a memcg attached to the folio/task being dirtied or
> > > not. If there isn't a cgroup based writeback task, then it uses the
> > > bdi->wb as the wb context.
> > 
> > We have created a proof of concept for per-AG context-based writeback, as
> > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
> > the filesystem handler, __mark_inode_dirty() selects writeback context
> > corresponding to the inode.
> > 
> > We attempted to handle memcg and bdi based writeback in a similar manner.
> > This approach aims to maintain the original writeback semantics while
> > providing parallelism. This helps in pushing more data early to the
> > device, trying to ease the write pressure faster.
> > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/
> 
> Yeah, I've seen the patches. Sorry for not getting to you earlier.
>  
> > > Then selecting inodes for writeback becomes a list_lru_walk()
> > > variant depending on what needs to be written back (e.g. physical
> > > node, memcg, both, everything that is dirty everywhere, etc).
> > 
> > We considered using list_lru to track inodes within a writeback context.
> > This can be implemented as:
> > struct bdi_writeback {
> >  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
> >  struct list_lru b_io_dirty_inodes_lru;
> >  ...
> >  ...
> > };
> > By doing this, we would obtain a sharded list of inodes per NUMA node.
> 
> I think you've misunderstood Dave's suggestion here. list_lru was given as
> an example of a structure for inspiration. We cannot take it directly as is
> for writeback purposes because we don't want to be sharding based on NUMA
> nodes but rather based on some other (likely FS driven) criteria.

Well, you might say that, but.....

... I was actually thinking of taking the list_lru and abstracting
it's N-way parallelism away from the numa infrastructure.

The NUMA awareness of the list_lru is largely in external APIs. Th
eonly implicit NUMA awareness is in the list_lru_add() function
where it converts the object being added to the list to a node ID
based on where it is physically located in memory.

The only other thing that is NUMA specific is that the list is set
up with N-way concurrency when N = the number of NUMA nodes in the
machine.

So, really, it is just thin veneer of NUMA wrapped around the
inherent concurrency built into the structure.

IOWs, when we create a list_lru for a numa aware shrinker, we simply
use the number of nodes as the N-way parallelism for the list,
and the existing per-node infrastructure simply feeds the right
numa node ID as the "list index" for it to function as is.

In the case of writeback parallelism, we could create a list_lru
with the number of AGs as the N-way parallism for the list, and then
have the concurrent BDI writeback context (1 per AG) simply provide
the AG number as the "list index" for writeback....

> > However, we would also need per-NUMA writeback contexts. Otherwise,
> > even if inodes are NUMA-sharded, a single writeback context would stil
> > process them sequentially, limiting parallelism. But there’s a concern:
> > NUMA-based writeback contexts are not aligned with filesystem geometry,
> > which could negatively impact delayed allocation and writeback efficiency,
> > as you pointed out in your previous reply [2].
> > 
> > Would it be better to let the filesystem dictate the number of writeback
> > threads, rather than enforcing a per-NUMA model?
> 
> I was thinking about how to best parallelize the writeback and I think
> there are two quite different demands for which we probably want two
> different levels of parallelism.
> 
> One case is the situation when the filesystem for example has multiple
> underlying devices (like btrfs or bcachefs) or for other reasons writeback
> to different parts is fairly independent (like for different XFS AGs). Here
> we want parallelism at rather high level I think including separate
> dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
> separate bdis (struct backing_dev_info) but I think it would be technically
> and also conceptually somewhat easier to do the multiplexing by factoring
> out:
> 
>         struct bdi_writeback wb;  /* the root writeback info for this bdi */
>         struct list_head wb_list; /* list of all wbs */
> #ifdef CONFIG_CGROUP_WRITEBACK
>         struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>         struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> #endif
>         wait_queue_head_t wb_waitq;
> 
> into a new structure (looking for a good name - bdi_writeback_context???)
> that can get multiplied (filesystem can create its own bdi on mount and
> configure there number of bdi_writeback_contexts it wants). We also need to
> add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
> which will return appropriate bdi_writeback_context (or perhaps just it's
> index?) for an inode. This will be used by the filesystem to direct
> writeback code where the inode should go. This is kind of what Kundan did
> in the last revision of his patches but I hope this approach should
> somewhat limit the changes necessary to writeback infrastructure - the
> patch 2 in his series is really unreviewably large...

Yes, this is the equivalent of the SHRINKER_NUMA_AWARE flag that
triggers the shrinker infrastructure to track work in a NUMA aware
way when the subsystem shrinker callbacks are using the NUMA aware
list_lru to back it....

> Then another case is a situation where either the amount of CPU work is
> rather high for IO submission (cases like Christoph mentioned where
> filesystem needs to do checksumming on submission or similar) or simply the
> device is rather fast for a single submission thread and the FS doesn't
> have a sensible way to partition inodes (e.g. for ext4 there's no
> meaningful way of partitioning inodes into independent groups - ext4
> allocation groups are small and inodes often span multiple groups and the
> sets of groups used by different inodes randomly overlap). In this case I
> think we want single dirty throttling instance, single writeback throughput
> estimation, single set of dirty inode lists etc. The level where the
> parallelism needs to happen is fairly low - I'd say duplicate:
> 
> 	struct delayed_work dwork;      /* work item used for writeback */
> 
> in struct bdi_writeback. Again, the number of dworks should be configurable
> when creating bdi for the filesystem. 

Right - this is what I described early on in the discussion as a
need for both scaling out across bdi contexts as well as scaling up
within a single bdi context.

That is, even within the context of a BDI writeback context per AG
in XFS, we can hav eneed for multiple IO submissions to be in flight
at once. e.g. one IO submission gets blocked doing allocation, but
there are a heap of pure overwrites queued up behind it. We can
submit the pure overwrites without getting blocked on the
allocation, but if we don't have IO submission concurrency within a
BDI, we cannot do that at all.

IOWs, these aren't exclusive optimisations to writeback; there are
situations were only one of the two mechanisms can solve the
concurrency problem. i.e. we need both high level and low level
concurrency mechanisms in filesystems like XFS....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx