On Tue, Sep 24, 2024 at 09:57:13AM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. I'm not about to disagree with that assessment. > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Right, that's effectively what the dlist infrastructure has taught us - we need some kind of heirarchy to spread the locking over. But what that heirachy is for a "iterate every object" list looks like isn't really clear cut... Another option I'd considered was to offload the iteration to filesystems that have internal tracking mechanisms. e.g. use a superblock method (say ->for_all_cached_inodes()) that gets passed a callback function for the operation to perform on a stabilised inode. This would work for XFS - we already have such an internal callback based cache-walking infrastructure - xfs_iwalk() - that is used extensively for internal admin, gc and scrub/repair functions. If we could use this for VFS icache traversals, then we wouldn't need to maintain the sb->s_inodes list at all in XFS. But I didn't go down that route because I didn't think we wanted to encourage each major filesysetm to have their own unique internal inode caching implementations with there own special subtle differences. The XFS icache code is pretty complex and really requires a lot of XFS expertise to understand - that makes changing global inode caching behaviour or life cycle semantics much more difficult than it already is. That said, if we do decide that a model where filesystems will provide their own inode caches is acceptible, then as a first step we could convert the generic s_inodes list iteration to the callback model fairly easily.... > (We used to have that kind of "flat lock" for the dcache too, but > "dcache_lock" went away many moons ago, and good riddance - but the > only reason it could go away is that the dcache has a hierarchy, so > that you can always lock just the local dentry (or the parent dentry) > if you are just careful). > > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). The only problem with this is the size of the per-sb hash tables needed for scalability - we can't allocate system sized hash tables for every superblock just in case a superblock might be asked to cache 100 million inodes. That's why Kent used rhashtables for the bcachefs implementation - they resize according to how many objects are being indexed, and hence scale both up and down. That is also why XFS uses multiple radix trees per-sb in it's icache implementation - they scale up efficiently, yet have a small memory footprint when only a few inodes are in cache in a little used filesystem. > But that would require something much more involved than "improve the > current code". Yup. FWIW, I think all this "how do we cache inodes better" discussion is somehwat glossing over a more important question we need to think about first: do we even need a fully fledged inode cache anymore? Every inode brought into cache is pinned by a dentry. The dentry cache has an LRU and cache aging, and so by the time a dentry is aged out of the dentry cache and the inode is finally dropped, it has not been in use for some time. It has aged out of the current working set. Given that we've already aged the inode out of the working set. why do we then need to dump the inode into another LRU and age that out again before we reclaim the inode? What does this double caching actually gaining us? I've done experiments on XFS marking all inodes with I_DONT_CACHE, which means it gets removed from the inode cache the moment the reference count goes to zero (i.e. when the dentry is dropped from cache). This leaves the inode life cycle mirroring the dentry life-cycle. i.e. the inode is evicted when the dentry is aged out as per normal. On SSD based systems, I really don't see any performance degradation for my typical workstation workloads like git tree operations and kernel compiles. I don't see any noticable impact on streaming inode/data workloads, either. IOWs, the dentry cache appears to be handling the workings set maintenance duties pretty well for most common workloads. Hence I question the need for LRU based inode cache aging being needed at all. So: should we be looking towards gutting the inode cache and so the in-memory VFS inode lifecycle tracks actively referenced inodes? If so, then the management of the VFS inodes becomes a lot simpler as the LRU lock, maintenance and shrinker-based reclaim goes away entirely. Lots of code gets simpler if we trim down the VFS inode life cycle to remove the caching of unreferenced inodes... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx