On Tue, Mar 18, 2025 at 02:42:18PM +1100, Dave Chinner wrote: > On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote: > > On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > > > Well, that's currently selected by __inode_attach_wb() based on > > > > whether there is a memcg attached to the folio/task being dirtied or > > > > not. If there isn't a cgroup based writeback task, then it uses the > > > > bdi->wb as the wb context. > > > > > > We have created a proof of concept for per-AG context-based writeback, as > > > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > > > the filesystem handler, __mark_inode_dirty() selects writeback context > > > corresponding to the inode. > > > > > > We attempted to handle memcg and bdi based writeback in a similar manner. > > > This approach aims to maintain the original writeback semantics while > > > providing parallelism. This helps in pushing more data early to the > > > device, trying to ease the write pressure faster. > > > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/ > > > > Yeah, I've seen the patches. Sorry for not getting to you earlier. > > > > > > Then selecting inodes for writeback becomes a list_lru_walk() > > > > variant depending on what needs to be written back (e.g. physical > > > > node, memcg, both, everything that is dirty everywhere, etc). > > > > > > We considered using list_lru to track inodes within a writeback context. > > > This can be implemented as: > > > struct bdi_writeback { > > > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > > > struct list_lru b_io_dirty_inodes_lru; > > > ... > > > ... > > > }; > > > By doing this, we would obtain a sharded list of inodes per NUMA node. > > > > I think you've misunderstood Dave's suggestion here. list_lru was given as > > an example of a structure for inspiration. We cannot take it directly as is > > for writeback purposes because we don't want to be sharding based on NUMA > > nodes but rather based on some other (likely FS driven) criteria. > > Well, you might say that, but..... > > ... I was actually thinking of taking the list_lru and abstracting > it's N-way parallelism away from the numa infrastructure. > > The NUMA awareness of the list_lru is largely in external APIs. Th > eonly implicit NUMA awareness is in the list_lru_add() function > where it converts the object being added to the list to a node ID > based on where it is physically located in memory. > > The only other thing that is NUMA specific is that the list is set > up with N-way concurrency when N = the number of NUMA nodes in the > machine. > > So, really, it is just thin veneer of NUMA wrapped around the > inherent concurrency built into the structure. > > IOWs, when we create a list_lru for a numa aware shrinker, we simply > use the number of nodes as the N-way parallelism for the list, > and the existing per-node infrastructure simply feeds the right > numa node ID as the "list index" for it to function as is. > > In the case of writeback parallelism, we could create a list_lru > with the number of AGs as the N-way parallism for the list, and then > have the concurrent BDI writeback context (1 per AG) simply provide > the AG number as the "list index" for writeback.... > If we go ahead with Jan's suggestion, each AG will have a separate bdi_writeback_context. Each bdi_writeback_context has its own b_dirty inode list. This naturally partitions inodes per AG. Given that, do we still need AG based sharded list_lru? Am I missing something here?