On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote: > On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > > Well, that's currently selected by __inode_attach_wb() based on > > > whether there is a memcg attached to the folio/task being dirtied or > > > not. If there isn't a cgroup based writeback task, then it uses the > > > bdi->wb as the wb context. > > > > We have created a proof of concept for per-AG context-based writeback, as > > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > > the filesystem handler, __mark_inode_dirty() selects writeback context > > corresponding to the inode. > > > > We attempted to handle memcg and bdi based writeback in a similar manner. > > This approach aims to maintain the original writeback semantics while > > providing parallelism. This helps in pushing more data early to the > > device, trying to ease the write pressure faster. > > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/ > > Yeah, I've seen the patches. Sorry for not getting to you earlier. > > > > Then selecting inodes for writeback becomes a list_lru_walk() > > > variant depending on what needs to be written back (e.g. physical > > > node, memcg, both, everything that is dirty everywhere, etc). > > > > We considered using list_lru to track inodes within a writeback context. > > This can be implemented as: > > struct bdi_writeback { > > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > > struct list_lru b_io_dirty_inodes_lru; > > ... > > ... > > }; > > By doing this, we would obtain a sharded list of inodes per NUMA node. > > I think you've misunderstood Dave's suggestion here. list_lru was given as > an example of a structure for inspiration. We cannot take it directly as is > for writeback purposes because we don't want to be sharding based on NUMA > nodes but rather based on some other (likely FS driven) criteria. Well, you might say that, but..... ... I was actually thinking of taking the list_lru and abstracting it's N-way parallelism away from the numa infrastructure. The NUMA awareness of the list_lru is largely in external APIs. Th eonly implicit NUMA awareness is in the list_lru_add() function where it converts the object being added to the list to a node ID based on where it is physically located in memory. The only other thing that is NUMA specific is that the list is set up with N-way concurrency when N = the number of NUMA nodes in the machine. So, really, it is just thin veneer of NUMA wrapped around the inherent concurrency built into the structure. IOWs, when we create a list_lru for a numa aware shrinker, we simply use the number of nodes as the N-way parallelism for the list, and the existing per-node infrastructure simply feeds the right numa node ID as the "list index" for it to function as is. In the case of writeback parallelism, we could create a list_lru with the number of AGs as the N-way parallism for the list, and then have the concurrent BDI writeback context (1 per AG) simply provide the AG number as the "list index" for writeback.... > > However, we would also need per-NUMA writeback contexts. Otherwise, > > even if inodes are NUMA-sharded, a single writeback context would stil > > process them sequentially, limiting parallelism. But there’s a concern: > > NUMA-based writeback contexts are not aligned with filesystem geometry, > > which could negatively impact delayed allocation and writeback efficiency, > > as you pointed out in your previous reply [2]. > > > > Would it be better to let the filesystem dictate the number of writeback > > threads, rather than enforcing a per-NUMA model? > > I was thinking about how to best parallelize the writeback and I think > there are two quite different demands for which we probably want two > different levels of parallelism. > > One case is the situation when the filesystem for example has multiple > underlying devices (like btrfs or bcachefs) or for other reasons writeback > to different parts is fairly independent (like for different XFS AGs). Here > we want parallelism at rather high level I think including separate > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like > separate bdis (struct backing_dev_info) but I think it would be technically > and also conceptually somewhat easier to do the multiplexing by factoring > out: > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > struct list_head wb_list; /* list of all wbs */ > #ifdef CONFIG_CGROUP_WRITEBACK > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > #endif > wait_queue_head_t wb_waitq; > > into a new structure (looking for a good name - bdi_writeback_context???) > that can get multiplied (filesystem can create its own bdi on mount and > configure there number of bdi_writeback_contexts it wants). We also need to > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() > which will return appropriate bdi_writeback_context (or perhaps just it's > index?) for an inode. This will be used by the filesystem to direct > writeback code where the inode should go. This is kind of what Kundan did > in the last revision of his patches but I hope this approach should > somewhat limit the changes necessary to writeback infrastructure - the > patch 2 in his series is really unreviewably large... Yes, this is the equivalent of the SHRINKER_NUMA_AWARE flag that triggers the shrinker infrastructure to track work in a NUMA aware way when the subsystem shrinker callbacks are using the NUMA aware list_lru to back it.... > Then another case is a situation where either the amount of CPU work is > rather high for IO submission (cases like Christoph mentioned where > filesystem needs to do checksumming on submission or similar) or simply the > device is rather fast for a single submission thread and the FS doesn't > have a sensible way to partition inodes (e.g. for ext4 there's no > meaningful way of partitioning inodes into independent groups - ext4 > allocation groups are small and inodes often span multiple groups and the > sets of groups used by different inodes randomly overlap). In this case I > think we want single dirty throttling instance, single writeback throughput > estimation, single set of dirty inode lists etc. The level where the > parallelism needs to happen is fairly low - I'd say duplicate: > > struct delayed_work dwork; /* work item used for writeback */ > > in struct bdi_writeback. Again, the number of dworks should be configurable > when creating bdi for the filesystem. Right - this is what I described early on in the discussion as a need for both scaling out across bdi contexts as well as scaling up within a single bdi context. That is, even within the context of a BDI writeback context per AG in XFS, we can hav eneed for multiple IO submissions to be in flight at once. e.g. one IO submission gets blocked doing allocation, but there are a heap of pure overwrites queued up behind it. We can submit the pure overwrites without getting blocked on the allocation, but if we don't have IO submission concurrency within a BDI, we cannot do that at all. IOWs, these aren't exclusive optimisations to writeback; there are situations were only one of the two mechanisms can solve the concurrency problem. i.e. we need both high level and low level concurrency mechanisms in filesystems like XFS.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx