On Tue 18-03-25 17:07:12, Anuj Gupta wrote: > > I was thinking about how to best parallelize the writeback and I think > > there are two quite different demands for which we probably want two > > different levels of parallelism. > > > > One case is the situation when the filesystem for example has multiple > > underlying devices (like btrfs or bcachefs) or for other reasons writeback > > to different parts is fairly independent (like for different XFS AGs). Here > > we want parallelism at rather high level I think including separate > > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like > > separate bdis (struct backing_dev_info) but I think it would be technically > > and also conceptually somewhat easier to do the multiplexing by factoring > > out: > > > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > > struct list_head wb_list; /* list of all wbs */ > > #ifdef CONFIG_CGROUP_WRITEBACK > > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > > #endif > > wait_queue_head_t wb_waitq; > > > > into a new structure (looking for a good name - bdi_writeback_context???) > > that can get multiplied (filesystem can create its own bdi on mount and > > configure there number of bdi_writeback_contexts it wants). We also need to > > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() > > which will return appropriate bdi_writeback_context (or perhaps just it's > > index?) for an inode. This will be used by the filesystem to direct > > writeback code where the inode should go. This is kind of what Kundan did > > in the last revision of his patches but I hope this approach should > > somewhat limit the changes necessary to writeback infrastructure - the > > patch 2 in his series is really unreviewably large... > > Thanks Jan. > > I tried to modify the existing infra based on your suggestion [1]. This > only takes care of the first case that you mentioned. I am yet to start > to evaluate and implement the second case (when amount of cpu work is > high). The patch is large, because it requires changing all the places > that uses bdi's fields that have now been moved to a new structure. I > will try to streamline it further. > > I have omitted the xfs side plumbing for now. > Can you please see if this aligns with the scheme that you had in mind? > If the infra looks fine, I can plumb XFS changes on top of it. > > Note: This patch is only compile tested, haven't run any tests on it. Sure, this is fine for this early prototyping phase. > Subject: [PATCH] writeback: add infra for parallel writeback > > Introduce a new bdi_writeback_ctx structure that enables us to have > multiple writeback contexts for parallel writeback. > > Existing single threaded writeback will use default_ctx. > > Filesystem now have the option to create there own bdi aind populate > multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but > plan to move to use a better structure like list_lru). > > Introduce a new hook get_inode_wb_ctx() in super_operations which takes > inode as a parameter and returns the bdi_wb_ctx corresponding to the > inode. > > Modify all the functions/places that operate on bdi's wb, wb_list, > cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved > to bdi_writeback_ctx > > Store a reference to bdi_wb_ctx in bdi_writeback. > > Suggested-by: Jan Kara <jack@xxxxxxx> > Signed-off-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx> > Signed-off-by: Kundan Kumar <kundan.kumar@xxxxxxxxxxx> Thanks for the patch! Overall I think we are going in the right direction :) Some comments below. > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 3cd99e2dc6ac..4c47c298f174 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) > { > struct backing_dev_info *bdi = inode_to_bdi(inode); > struct bdi_writeback *wb = NULL; > + struct bdi_writeback_ctx *bdi_writeback_ctx = > + fetch_bdi_writeback_ctx(inode); ^^^ I'd call this inode_bdi_writeback_ctx(). I think that name tells a bit more. > if (inode_cgwb_enabled(inode)) { > struct cgroup_subsys_state *memcg_css; > > if (folio) { > memcg_css = mem_cgroup_css_from_folio(folio); > - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); > + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, > + GFP_ATOMIC); > } else { > /* must pin memcg_css, see wb_get_create() */ > memcg_css = task_get_css(current, memory_cgrp_id); > - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); > + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, > + GFP_ATOMIC); > css_put(memcg_css); > } > } > > if (!wb) > - wb = &bdi->wb; > + wb = &bdi_writeback_ctx->wb; Perhaps as a preparation patch, can you please rename bdi->wb to bdi->root_wb (just mechanical replacement) and then carry over this name to bdi_writeback_ctx? Because we have too many wbs here already and as the structure is getting more complex, explanatory naming becomes more important. Thanks! > @@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, > * And find the associated wb. If the wb isn't there already > * there's nothing to flush, don't create one. > */ > - wb = wb_get_lookup(bdi, memcg_css); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { > + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); > + if (wb) > + break; > + } > + } else { > + wb = wb_get_lookup(&bdi->default_ctx, memcg_css); > + } Why do you introduce this bdi->is_parallel bool? Why don't we unconditionally do: for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { wb = wb_get_lookup(bdi_wb_ctx, memcg_css); if (wb) break; } It would seem more obvious and with less code duplication... It might require a bit of magic inside for_each_bdi_wb_ctx() but I think it pays off. BTW, you should initialize 'wb' before the loop to NULL. Otherwise I expect some compilers to complain and also for readers it makes the test below obviously do the right thing. > if (!wb) { > ret = -ENOENT; > goto out_css_put; ... > @@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) > > static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > struct wb_writeback_work *base_work, > - bool skip_if_busy) > + bool skip_if_busy, > + struct bdi_writeback_ctx *bdi_wb_ctx) Nit; logical position for bdi_wb_ctx argument would be just after 'bdi' but see below - I think we could pass only bdi_wb_ctx here after some changes. > @@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb) > mutex_unlock(&sb->s_sync_lock); > } > > -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > - enum wb_reason reason, bool skip_if_busy) > +static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb, > + unsigned long nr, enum wb_reason reason, > + bool skip_if_busy, > + struct bdi_writeback_ctx *bdi_wb_ctx) > { > - struct backing_dev_info *bdi = sb->s_bdi; > - DEFINE_WB_COMPLETION(done, bdi); > + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); > struct wb_writeback_work work = { > .sb = sb, > .sync_mode = WB_SYNC_NONE, > @@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > .reason = reason, > }; > > + bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx); > + wb_wait_for_completion(&done); > +} > + > +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > + enum wb_reason reason, bool skip_if_busy) > +{ > + struct backing_dev_info *bdi = sb->s_bdi; > + > if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) > return; > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > - bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); > - wb_wait_for_completion(&done); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) > + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, > + skip_if_busy, bdi_wb_ctx); > + } else { > + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, > + skip_if_busy, &bdi->default_ctx); > + } > } If we drop this 'is_parallel' thing, I think we don't need the new __writeback_inodes_sb_nr_bdi_wb_ctx() function and can keep everything in one function. > > /** > @@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) > } > EXPORT_SYMBOL(try_to_writeback_inodes_sb); > > -/** > - * sync_inodes_sb - sync sb inode pages > - * @sb: the superblock > - * > - * This function writes and waits on any dirty inode belonging to this > - * super_block. > - */ > -void sync_inodes_sb(struct super_block *sb) > +static void sync_inodes_bdi_wb_ctx(struct super_block *sb, > + struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx) > { > - struct backing_dev_info *bdi = sb->s_bdi; > - DEFINE_WB_COMPLETION(done, bdi); > + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); > struct wb_writeback_work work = { > .sb = sb, > .sync_mode = WB_SYNC_ALL, > @@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb) > .reason = WB_REASON_SYNC, > .for_sync = 1, > }; > + bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx); > + bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx); > + wb_wait_for_completion(&done); > + bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx); > +} > + > +/** > + * sync_inodes_sb - sync sb inode pages > + * @sb: the superblock > + * > + * This function writes and waits on any dirty inode belonging to this > + * super_block. > + */ > +void sync_inodes_sb(struct super_block *sb) > +{ > + struct backing_dev_info *bdi = sb->s_bdi; > > /* > * Can't skip on !bdi_has_dirty() because we should wait for !dirty > @@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb) > return; > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > - /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ > - bdi_down_write_wb_switch_rwsem(bdi); > - bdi_split_work_to_wbs(bdi, &work, false); > - wb_wait_for_completion(&done); > - bdi_up_write_wb_switch_rwsem(bdi); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { > + sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx); > + } > + } else { > + sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx); > + } > wait_sb_inodes(sb); > } > EXPORT_SYMBOL(sync_inodes_sb); The same comment as above. > @@ -104,6 +104,7 @@ struct wb_completion { > */ > struct bdi_writeback { > struct backing_dev_info *bdi; /* our parent bdi */ > + struct bdi_writeback_ctx *bdi_wb_ctx; > > unsigned long state; /* Always use atomic bitops on this */ > unsigned long last_old_flush; /* last old data flush */ > @@ -160,6 +161,16 @@ struct bdi_writeback { > #endif > }; > > +struct bdi_writeback_ctx { > + struct bdi_writeback wb; /* the root writeback info for this bdi */ > + struct list_head wb_list; /* list of all wbs */ > +#ifdef CONFIG_CGROUP_WRITEBACK > + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > + struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > +#endif > + wait_queue_head_t wb_waitq; > +}; > + As I was seeing the patch, I'd also add: struct backing_dev_info *bdi; pointer to bdi_writeback_ctx. Then in most functions, you can just pass bdi_wb_ctx into them instead of both bdi *and* bdi_wb_ctx and it makes interfaces somewhat nicer. > struct backing_dev_info { > u64 id; > struct rb_node rb_node; /* keyed by ->id */ > @@ -182,16 +193,13 @@ struct backing_dev_info { > * blk-wbt. > */ > unsigned long last_bdp_sleep; > - > - struct bdi_writeback wb; /* the root writeback info for this bdi */ > - struct list_head wb_list; /* list of all wbs */ > + struct bdi_writeback_ctx default_ctx; > + bool is_parallel; > + int nr_wb_ctx; > + struct xarray bdi_wb_ctx_arr; I think xarray here is overkill. I'd just make this a plain array: struct bdi_writeback_ctx **bdi_wb_ctx_arr; which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd make default_ctx just be entry at index 0 in this array. I'm undecided whether it will be clearer to just drop default_ctx altogether or keep it and set: struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0]; on init so I'll leave that up to you. > #ifdef CONFIG_CGROUP_WRITEBACK > - struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ > - struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > #endif > - wait_queue_head_t wb_waitq; > - > struct device *dev; > char dev_name[64]; > struct device *owner; ... > +static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode_to_bdi(inode); > + struct super_block *sb = inode->i_sb; > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + if (sb->s_op->get_inode_wb_ctx) > + bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode); > + else > + bdi_wb_ctx = &bdi->default_ctx; > + return bdi_wb_ctx; > +} The indirect call (and the handling in there) might get potentially expensive given all the places where this is called. For now I think we are fine but eventually we might need to find space in struct inode (I know that's a hard sell) for u8/u16 to store writeback context index there. But first let's get this working before we burry ourselves in (potentially premature) performance optimizations. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7e29433c5ecc..667f97c68cd1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2228,6 +2228,7 @@ struct super_operations { > struct inode *(*alloc_inode)(struct super_block *sb); > void (*destroy_inode)(struct inode *); > void (*free_inode)(struct inode *); > + struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *); ^^ wrongly placed space here. > > void (*dirty_inode) (struct inode *, int flags); > int (*write_inode) (struct inode *, struct writeback_control *wbc); Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR