On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > Well, that's currently selected by __inode_attach_wb() based on > > whether there is a memcg attached to the folio/task being dirtied or > > not. If there isn't a cgroup based writeback task, then it uses the > > bdi->wb as the wb context. > > We have created a proof of concept for per-AG context-based writeback, as > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > the filesystem handler, __mark_inode_dirty() selects writeback context > corresponding to the inode. > > We attempted to handle memcg and bdi based writeback in a similar manner. > This approach aims to maintain the original writeback semantics while > providing parallelism. This helps in pushing more data early to the > device, trying to ease the write pressure faster. > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/ Yeah, I've seen the patches. Sorry for not getting to you earlier. > > Then selecting inodes for writeback becomes a list_lru_walk() > > variant depending on what needs to be written back (e.g. physical > > node, memcg, both, everything that is dirty everywhere, etc). > > We considered using list_lru to track inodes within a writeback context. > This can be implemented as: > struct bdi_writeback { > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > struct list_lru b_io_dirty_inodes_lru; > ... > ... > }; > By doing this, we would obtain a sharded list of inodes per NUMA node. I think you've misunderstood Dave's suggestion here. list_lru was given as an example of a structure for inspiration. We cannot take it directly as is for writeback purposes because we don't want to be sharding based on NUMA nodes but rather based on some other (likely FS driven) criteria. > However, we would also need per-NUMA writeback contexts. Otherwise, > even if inodes are NUMA-sharded, a single writeback context would stil > process them sequentially, limiting parallelism. But there’s a concern: > NUMA-based writeback contexts are not aligned with filesystem geometry, > which could negatively impact delayed allocation and writeback efficiency, > as you pointed out in your previous reply [2]. > > Would it be better to let the filesystem dictate the number of writeback > threads, rather than enforcing a per-NUMA model? I was thinking about how to best parallelize the writeback and I think there are two quite different demands for which we probably want two different levels of parallelism. One case is the situation when the filesystem for example has multiple underlying devices (like btrfs or bcachefs) or for other reasons writeback to different parts is fairly independent (like for different XFS AGs). Here we want parallelism at rather high level I think including separate dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like separate bdis (struct backing_dev_info) but I think it would be technically and also conceptually somewhat easier to do the multiplexing by factoring out: struct bdi_writeback wb; /* the root writeback info for this bdi */ struct list_head wb_list; /* list of all wbs */ #ifdef CONFIG_CGROUP_WRITEBACK struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ #endif wait_queue_head_t wb_waitq; into a new structure (looking for a good name - bdi_writeback_context???) that can get multiplied (filesystem can create its own bdi on mount and configure there number of bdi_writeback_contexts it wants). We also need to add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() which will return appropriate bdi_writeback_context (or perhaps just it's index?) for an inode. This will be used by the filesystem to direct writeback code where the inode should go. This is kind of what Kundan did in the last revision of his patches but I hope this approach should somewhat limit the changes necessary to writeback infrastructure - the patch 2 in his series is really unreviewably large... Then another case is a situation where either the amount of CPU work is rather high for IO submission (cases like Christoph mentioned where filesystem needs to do checksumming on submission or similar) or simply the device is rather fast for a single submission thread and the FS doesn't have a sensible way to partition inodes (e.g. for ext4 there's no meaningful way of partitioning inodes into independent groups - ext4 allocation groups are small and inodes often span multiple groups and the sets of groups used by different inodes randomly overlap). In this case I think we want single dirty throttling instance, single writeback throughput estimation, single set of dirty inode lists etc. The level where the parallelism needs to happen is fairly low - I'd say duplicate: struct delayed_work dwork; /* work item used for writeback */ in struct bdi_writeback. Again, the number of dworks should be configurable when creating bdi for the filesystem. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR