On Tue 11-02-25 12:13:18, Dave Chinner wrote: > On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote: > > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote: > > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > > > I doubt that will create enough concurrency for a typical small > > > > server or desktop machine that only has a single NUMA node but has a > > > > couple of fast nvme SSDs in it. > > > > > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > > > 3) NUMCPU/N number of writeback contexts. > > > > > > > > These don't take into account the concurrency available from > > > > the underlying filesystem or storage. > > > > > > > > That's the point I was making - CPU count has -zero- relationship to > > > > the concurrency the filesystem and/or storage provide the system. It > > > > is fundamentally incorrect to base decisions about IO concurrency on > > > > the number of CPU cores in the system. > > > > > > Yes. But as mention in my initial reply there is a use case for more > > > WB threads than fs writeback contexts, which is when the writeback > > > threads do CPU intensive work like compression. Being able to do that > > > from normal writeback threads vs forking out out to fs level threads > > > would really simply the btrfs code a lot. Not really interesting for > > > XFS right now of course. > > > > > > Or in other words: fs / device geometry really should be the main > > > driver, but if a file systems supports compression (or really expensive > > > data checksums) being able to scale up the numbes of threads per > > > context might still make sense. But that's really the advanced part, > > > we'll need to get the fs geometry aligned to work first. > > > > As I'm reading the thread it sounds to me the writeback subsystem should > > provide an API for the filesystem to configure number of writeback > > contexts which would be kind of similar to what we currently do for cgroup > > aware writeback? > > Yes, that's pretty much what I've been trying to say. > > > Currently we create writeback context per cgroup so now > > additionally we'll have some property like "inode writeback locality" that > > will also influence what inode->i_wb gets set to and hence where > > mark_inode_dirty() files inodes etc. > > Well, that's currently selected by __inode_attach_wb() based on > whether there is a memcg attached to the folio/task being dirtied or > not. If there isn't a cgroup based writeback task, then it uses the > bdi->wb as the wb context. > > In my mind, what you are describing above sounds like we would be > heading down the same road list_lru started down back in 2012 to > support NUMA scalability for LRU based memory reclaim. > > i.e. we originally had a single global LRU list for important > caches. This didn't scale up, so I introduced the list_lru construct > to abstract the physical layout of the LRU from the objects being > stored on it and the reclaim infrastructure walking it. That gave us > per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The > fundamental concept was that we abstract away the sharding of the > object tracking into per-physical-node structures via generic > infrastructure (i.e. list_lru). > > Then memcgs needed memory reclaim, and so they were added as extra > lists with a different indexing mechanism to the list-lru contexts. > These weren't per-node lists because there could be thousands of > them. Hence it was just a single "global" list per memcg, and so it > didn't scale on large machines. > > This wasn't seen as a problem initially, but a few years later > applications using memcgs wanted to scale properly on large NUMA > systems. So now we have each memcg tracking the physical per-node > memory usage for reclaim purposes (i.e. combinatorial explosion of > memcg vs per-node lists). > > Hence suggesting "physically sharded lists for global objects, > single per-cgroup lists for cgroup-owned objects" sounds like > exactly the same problem space progression is about to play out with > writeback contexts. > > i.e. we shared the global writeback context into a set of physically > sharded lists for scalability and perofrmance reasons, but leave > cgroups with the old single threaded list constructs. Then someone > says "my cgroup based workload doesn't perform the same as a global > workload" and we're off to solve the problem list_lru solves again. > > So.... > > Should we be looking towards using a subset of the existing list_lru > functionality for writeback contexts here? i.e. create a list_lru > object with N-way scalability, allow the fs to provide an > inode-number-to-list mapping function, and use the list_lru > interfaces to abstract away everything physical and cgroup related > for tracking dirty inodes? Interesting idea. Indeed, the similarity with problems list_lru is solving is significant. I like the idea. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR