Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

Jan Kara <jack@xxxxxxx> · Wed, 19 Mar 2025 16:54:07 +0100

On Tue 18-03-25 17:07:12, Anuj Gupta wrote:
> > I was thinking about how to best parallelize the writeback and I think
> > there are two quite different demands for which we probably want two
> > different levels of parallelism.
> > 
> > One case is the situation when the filesystem for example has multiple
> > underlying devices (like btrfs or bcachefs) or for other reasons writeback
> > to different parts is fairly independent (like for different XFS AGs). Here
> > we want parallelism at rather high level I think including separate
> > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
> > separate bdis (struct backing_dev_info) but I think it would be technically
> > and also conceptually somewhat easier to do the multiplexing by factoring
> > out:
> > 
> >         struct bdi_writeback wb;  /* the root writeback info for this bdi */
> >         struct list_head wb_list; /* list of all wbs */
> > #ifdef CONFIG_CGROUP_WRITEBACK
> >         struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
> >         struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> > #endif
> >         wait_queue_head_t wb_waitq;
> > 
> > into a new structure (looking for a good name - bdi_writeback_context???)
> > that can get multiplied (filesystem can create its own bdi on mount and
> > configure there number of bdi_writeback_contexts it wants). We also need to
> > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
> > which will return appropriate bdi_writeback_context (or perhaps just it's
> > index?) for an inode. This will be used by the filesystem to direct
> > writeback code where the inode should go. This is kind of what Kundan did
> > in the last revision of his patches but I hope this approach should
> > somewhat limit the changes necessary to writeback infrastructure - the
> > patch 2 in his series is really unreviewably large...
> 
> Thanks Jan.
> 
> I tried to modify the existing infra based on your suggestion [1]. This
> only takes care of the first case that you mentioned. I am yet to start
> to evaluate and implement the second case (when amount of cpu work is
> high). The patch is large, because it requires changing all the places
> that uses bdi's fields that have now been moved to a new structure. I
> will try to streamline it further.
> 
> I have omitted the xfs side plumbing for now.
> Can you please see if this aligns with the scheme that you had in mind?
> If the infra looks fine, I can plumb XFS changes on top of it.
> 
> Note: This patch is only compile tested, haven't run any tests on it.

Sure, this is fine for this early prototyping phase.

> Subject: [PATCH] writeback: add infra for parallel writeback
> 
> Introduce a new bdi_writeback_ctx structure that enables us to have
> multiple writeback contexts for parallel writeback.
> 
> Existing single threaded writeback will use default_ctx.
> 
> Filesystem now have the option to create there own bdi aind populate
> multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but
> plan to move to use a better structure like list_lru).
> 
> Introduce a new hook get_inode_wb_ctx() in super_operations which takes
> inode as a parameter and returns the bdi_wb_ctx corresponding to the
> inode.
> 
> Modify all the functions/places that operate on bdi's wb, wb_list,
> cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved
> to bdi_writeback_ctx
> 
> Store a reference to bdi_wb_ctx in bdi_writeback.
> 
> Suggested-by: Jan Kara <jack@xxxxxxx>
> Signed-off-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx>
> Signed-off-by: Kundan Kumar <kundan.kumar@xxxxxxxxxxx>

Thanks for the patch! Overall I think we are going in the right direction
:) Some comments below.

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3cd99e2dc6ac..4c47c298f174 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
>  {
>  	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  	struct bdi_writeback *wb = NULL;
> +	struct bdi_writeback_ctx *bdi_writeback_ctx =
> +						fetch_bdi_writeback_ctx(inode);
						^^^
I'd call this inode_bdi_writeback_ctx(). I think that name tells a bit
more.

>  	if (inode_cgwb_enabled(inode)) {
>  		struct cgroup_subsys_state *memcg_css;
>  
>  		if (folio) {
>  			memcg_css = mem_cgroup_css_from_folio(folio);
> -			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
> +			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
> +					   GFP_ATOMIC);
>  		} else {
>  			/* must pin memcg_css, see wb_get_create() */
>  			memcg_css = task_get_css(current, memory_cgrp_id);
> -			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
> +			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
> +					   GFP_ATOMIC);
>  			css_put(memcg_css);
>  		}
>  	}
>  
>  	if (!wb)
> -		wb = &bdi->wb;
> +		wb = &bdi_writeback_ctx->wb;

Perhaps as a preparation patch, can you please rename bdi->wb to
bdi->root_wb (just mechanical replacement) and then carry over this name to
bdi_writeback_ctx? Because we have too many wbs here already and as the
structure is getting more complex, explanatory naming becomes more
important. Thanks!

> @@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
>  	 * And find the associated wb.  If the wb isn't there already
>  	 * there's nothing to flush, don't create one.
>  	 */
> -	wb = wb_get_lookup(bdi, memcg_css);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
> +			wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
> +			if (wb)
> +				break;
> +		}
> +	} else {
> +		wb = wb_get_lookup(&bdi->default_ctx, memcg_css);
> +	}

Why do you introduce this bdi->is_parallel bool? Why don't we
unconditionally do:

	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
		wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
		if (wb)
			break;
	}

It would seem more obvious and with less code duplication... It might
require a bit of magic inside for_each_bdi_wb_ctx() but I think it pays
off. BTW, you should initialize 'wb' before the loop to NULL. Otherwise I
expect some compilers to complain and also for readers it makes the test
below obviously do the right thing.

>  	if (!wb) {
>  		ret = -ENOENT;
>  		goto out_css_put;
...
> @@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
>  
>  static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  				  struct wb_writeback_work *base_work,
> -				  bool skip_if_busy)
> +				  bool skip_if_busy,
> +				  struct bdi_writeback_ctx *bdi_wb_ctx)

Nit; logical position for bdi_wb_ctx argument would be just after 'bdi' but
see below - I think we could pass only bdi_wb_ctx here after some changes.

> @@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb)
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
> -				     enum wb_reason reason, bool skip_if_busy)
> +static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb,
> +						unsigned long nr, enum wb_reason reason,
> +						bool skip_if_busy,
> +						struct bdi_writeback_ctx *bdi_wb_ctx)
>  {
> -	struct backing_dev_info *bdi = sb->s_bdi;
> -	DEFINE_WB_COMPLETION(done, bdi);
> +	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
>  	struct wb_writeback_work work = {
>  		.sb			= sb,
>  		.sync_mode		= WB_SYNC_NONE,
> @@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
>  		.reason			= reason,
>  	};
>  
> +	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx);
> +	wb_wait_for_completion(&done);
> +}
> +
> +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
> +				     enum wb_reason reason, bool skip_if_busy)
> +{
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +
>  	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
>  		return;
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
> -	wb_wait_for_completion(&done);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
> +			__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
> +							    skip_if_busy, bdi_wb_ctx);
> +	} else {
> +		__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
> +						    skip_if_busy, &bdi->default_ctx);
> +	}
>  }

If we drop this 'is_parallel' thing, I think we don't need the new
__writeback_inodes_sb_nr_bdi_wb_ctx() function and can keep everything
in one function.

>  
>  /**
> @@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
>  }
>  EXPORT_SYMBOL(try_to_writeback_inodes_sb);
>  
> -/**
> - * sync_inodes_sb	-	sync sb inode pages
> - * @sb: the superblock
> - *
> - * This function writes and waits on any dirty inode belonging to this
> - * super_block.
> - */
> -void sync_inodes_sb(struct super_block *sb)
> +static void sync_inodes_bdi_wb_ctx(struct super_block *sb,
> +		struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
>  {
> -	struct backing_dev_info *bdi = sb->s_bdi;
> -	DEFINE_WB_COMPLETION(done, bdi);
> +	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_ALL,
> @@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb)
>  		.reason		= WB_REASON_SYNC,
>  		.for_sync	= 1,
>  	};
> +	bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx);
> +	bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx);
> +	wb_wait_for_completion(&done);
> +	bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx);
> +}
> +
> +/**
> + * sync_inodes_sb	-	sync sb inode pages
> + * @sb: the superblock
> + *
> + * This function writes and waits on any dirty inode belonging to this
> + * super_block.
> + */
> +void sync_inodes_sb(struct super_block *sb)
> +{
> +	struct backing_dev_info *bdi = sb->s_bdi;
>  
>  	/*
>  	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
> @@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb)
>  		return;
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
> -	bdi_down_write_wb_switch_rwsem(bdi);
> -	bdi_split_work_to_wbs(bdi, &work, false);
> -	wb_wait_for_completion(&done);
> -	bdi_up_write_wb_switch_rwsem(bdi);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
>  
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
> +			sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx);
> +		}
> +	} else {
> +		sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx);
> +	}
>  	wait_sb_inodes(sb);
>  }
>  EXPORT_SYMBOL(sync_inodes_sb);

The same comment as above.

> @@ -104,6 +104,7 @@ struct wb_completion {
>   */
>  struct bdi_writeback {
>  	struct backing_dev_info *bdi;	/* our parent bdi */
> +	struct bdi_writeback_ctx *bdi_wb_ctx;
>  
>  	unsigned long state;		/* Always use atomic bitops on this */
>  	unsigned long last_old_flush;	/* last old data flush */
> @@ -160,6 +161,16 @@ struct bdi_writeback {
>  #endif
>  };
>  
> +struct bdi_writeback_ctx {
> +	struct bdi_writeback wb;  /* the root writeback info for this bdi */
> +	struct list_head wb_list; /* list of all wbs */
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
> +	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> +#endif
> +	wait_queue_head_t wb_waitq;
> +};
> +

As I was seeing the patch, I'd also add:
	struct backing_dev_info *bdi;

pointer to bdi_writeback_ctx. Then in most functions, you can just pass
bdi_wb_ctx into them instead of both bdi *and* bdi_wb_ctx and it makes
interfaces somewhat nicer.

>  struct backing_dev_info {
>  	u64 id;
>  	struct rb_node rb_node; /* keyed by ->id */
> @@ -182,16 +193,13 @@ struct backing_dev_info {
>  	 * blk-wbt.
>  	 */
>  	unsigned long last_bdp_sleep;
> -
> -	struct bdi_writeback wb;  /* the root writeback info for this bdi */
> -	struct list_head wb_list; /* list of all wbs */
> +	struct bdi_writeback_ctx default_ctx;
> +	bool is_parallel;
> +	int nr_wb_ctx;
> +	struct xarray bdi_wb_ctx_arr;

I think xarray here is overkill. I'd just make this a plain array:

	struct bdi_writeback_ctx **bdi_wb_ctx_arr;

which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd
make default_ctx just be entry at index 0 in this array. I'm undecided
whether it will be clearer to just drop default_ctx altogether or keep it
and set:

	struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0];

on init so I'll leave that up to you.

>  #ifdef CONFIG_CGROUP_WRITEBACK
> -	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>  	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
> -	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
>  #endif
> -	wait_queue_head_t wb_waitq;
> -
>  	struct device *dev;
>  	char dev_name[64];
>  	struct device *owner;

...

> +static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +	struct super_block *sb = inode->i_sb;
> +	struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +	if (sb->s_op->get_inode_wb_ctx)
> +		bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode);
> +	else
> +		bdi_wb_ctx = &bdi->default_ctx;
> +	return bdi_wb_ctx;
> +}

The indirect call (and the handling in there) might get potentially
expensive given all the places where this is called. For now I think we are
fine but eventually we might need to find space in struct inode (I know
that's a hard sell) for u8/u16 to store writeback context index there. But
first let's get this working before we burry ourselves in (potentially
premature) performance optimizations.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..667f97c68cd1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2228,6 +2228,7 @@ struct super_operations {
>     	struct inode *(*alloc_inode)(struct super_block *sb);
>  	void (*destroy_inode)(struct inode *);
>  	void (*free_inode)(struct inode *);
> +	struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *);
                                ^^ wrongly placed space here.
>  
>     	void (*dirty_inode) (struct inode *, int flags);
>  	int (*write_inode) (struct inode *, struct writeback_control *wbc);

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR