On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote: > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote: > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > > I doubt that will create enough concurrency for a typical small > > > server or desktop machine that only has a single NUMA node but has a > > > couple of fast nvme SSDs in it. > > > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > > 3) NUMCPU/N number of writeback contexts. > > > > > > These don't take into account the concurrency available from > > > the underlying filesystem or storage. > > > > > > That's the point I was making - CPU count has -zero- relationship to > > > the concurrency the filesystem and/or storage provide the system. It > > > is fundamentally incorrect to base decisions about IO concurrency on > > > the number of CPU cores in the system. > > > > Yes. But as mention in my initial reply there is a use case for more > > WB threads than fs writeback contexts, which is when the writeback > > threads do CPU intensive work like compression. Being able to do that > > from normal writeback threads vs forking out out to fs level threads > > would really simply the btrfs code a lot. Not really interesting for > > XFS right now of course. > > > > Or in other words: fs / device geometry really should be the main > > driver, but if a file systems supports compression (or really expensive > > data checksums) being able to scale up the numbes of threads per > > context might still make sense. But that's really the advanced part, > > we'll need to get the fs geometry aligned to work first. > > As I'm reading the thread it sounds to me the writeback subsystem should > provide an API for the filesystem to configure number of writeback > contexts which would be kind of similar to what we currently do for cgroup > aware writeback? Yes, that's pretty much what I've been trying to say. > Currently we create writeback context per cgroup so now > additionally we'll have some property like "inode writeback locality" that > will also influence what inode->i_wb gets set to and hence where > mark_inode_dirty() files inodes etc. Well, that's currently selected by __inode_attach_wb() based on whether there is a memcg attached to the folio/task being dirtied or not. If there isn't a cgroup based writeback task, then it uses the bdi->wb as the wb context. In my mind, what you are describing above sounds like we would be heading down the same road list_lru started down back in 2012 to support NUMA scalability for LRU based memory reclaim. i.e. we originally had a single global LRU list for important caches. This didn't scale up, so I introduced the list_lru construct to abstract the physical layout of the LRU from the objects being stored on it and the reclaim infrastructure walking it. That gave us per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The fundamental concept was that we abstract away the sharding of the object tracking into per-physical-node structures via generic infrastructure (i.e. list_lru). Then memcgs needed memory reclaim, and so they were added as extra lists with a different indexing mechanism to the list-lru contexts. These weren't per-node lists because there could be thousands of them. Hence it was just a single "global" list per memcg, and so it didn't scale on large machines. This wasn't seen as a problem initially, but a few years later applications using memcgs wanted to scale properly on large NUMA systems. So now we have each memcg tracking the physical per-node memory usage for reclaim purposes (i.e. combinatorial explosion of memcg vs per-node lists). Hence suggesting "physically sharded lists for global objects, single per-cgroup lists for cgroup-owned objects" sounds like exactly the same problem space progression is about to play out with writeback contexts. i.e. we shared the global writeback context into a set of physically sharded lists for scalability and perofrmance reasons, but leave cgroups with the old single threaded list constructs. Then someone says "my cgroup based workload doesn't perform the same as a global workload" and we're off to solve the problem list_lru solves again. So.... Should we be looking towards using a subset of the existing list_lru functionality for writeback contexts here? i.e. create a list_lru object with N-way scalability, allow the fs to provide an inode-number-to-list mapping function, and use the list_lru interfaces to abstract away everything physical and cgroup related for tracking dirty inodes? Then selecting inodes for writeback becomes a list_lru_walk() variant depending on what needs to be written back (e.g. physical node, memcg, both, everything that is dirty everywhere, etc). > Then additionally you'd like to have possibility to have more processes > operate on one struct wb_writeback (again configurable by filesystem?) to > handle cases of cpu intensive writeback submission more efficiently. Yeah, that is a separate problem... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx