Well, that's currently selected by __inode_attach_wb() based on whether there is a memcg attached to the folio/task being dirtied or not. If there isn't a cgroup based writeback task, then it uses the bdi->wb as the wb context.
We have created a proof of concept for per-AG context-based writeback, as described in [1]. The AG is mapped to a writeback context (wb_ctx). Using the filesystem handler, __mark_inode_dirty() selects writeback context corresponding to the inode. We attempted to handle memcg and bdi based writeback in a similar manner. This approach aims to maintain the original writeback semantics while providing parallelism. This helps in pushing more data early to the device, trying to ease the write pressure faster. [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@xxxxxxxxxxx/
Then selecting inodes for writeback becomes a list_lru_walk() variant depending on what needs to be written back (e.g. physical node, memcg, both, everything that is dirty everywhere, etc).
We considered using list_lru to track inodes within a writeback context. This can be implemented as: struct bdi_writeback { struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list struct list_lru b_io_dirty_inodes_lru; ... ... }; By doing this, we would obtain a sharded list of inodes per NUMA node. However, we would also need per-NUMA writeback contexts. Otherwise, even if inodes are NUMA-sharded, a single writeback context would stil process them sequentially, limiting parallelism. But there’s a concern: NUMA-based writeback contexts are not aligned with filesystem geometry, which could negatively impact delayed allocation and writeback efficiency, as you pointed out in your previous reply [2]. Would it be better to let the filesystem dictate the number of writeback threads, rather than enforcing a per-NUMA model? Do you see it differently? [2] https://lore.kernel.org/all/Z5qw_1BOqiFum5Dn@xxxxxxxxxxxxxxxxxxx/