On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 19:26, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > And I had missed the issue with PREEMPT_RT and the fact that right now > > the inode hash lock is outside the inode lock, which is problematic. > > .. although I guess that could be solved by just making the spinlock > be external to the hash list, rather than part of the list like > list_bh. That's effectively what the patch did - it added a spinlock per hash list. > You wouldn't need to do one lock per list head, you could just do some > grouping of the hash. Yes, that's a more complex thing to do, though - it is effectively using hashed locks for the hash table. > That said, the vfs inode hash code really is pretty disgusting and has > been accumulating these warts for a long time. So maybe a "filesystems > do their own thing" together with some helper infrastructure (like > using rhashtable) is actually the right thing to do. Perhaps. XFS has done it's own thing since 2007, but that was because the XFS inode lifecycle has always existed outside the VFS inode life cycle. i.e. the inode has to outlive evict() and ->destroy_inode because it can still be dirty and tracked by the journal when the VFS evicts it from the cache. We also have internal inode lookup stuff that we don't want to go anywhere near the VFS (e.g. inode writeback clustering from journal item cleaning callbacks). But this has come at a significant complexity cost, and some of the biggest problems we've had to solve have been a result of this architecture. e.g. memory reclaim getting blocked on dirty XFS inodes that the VFS is unaware of via the fs specific callout in the superblock shrinker caused all sorts of long tail memory allocation latency issues for years. We've fixed that, but it took a long time to work out how to avoid blocking in memory reclaim without driving the system immediately into OOM kill frenzies.... There are also issues around RCU freeing of inodes and how the filesysetm level cache handles that. e.g. what happens when the filesystem doesn't immediately free the inode after memory reclaim evicts it, but then the fs re-instantiates it moments later when a new lookup for that inode comes in? The rest of the VFS assumes that the inode won't reappear until a RCU grace period expires, but having the re-instantiation code call synchronise_rcu() (or something similar) is not an option because this sort of situation can occur thousands of times a second under memory pressure.... We know this because there is an outstanding issue in XFS around this situation - none of the general solutions to the problem we've tried have been viable. I suspect that any other filesystem that implements it's own inode cache and does VFS inode re-instantiation on a inode cache hit will have similar issues, and so there will be wider VFS architecture impacts as a result. IOWs, it's not clear to me that this is a caching model we really want to persue in general because of a) the code duplication and b) the issues such an inode life-cycle model has interacting with the VFS life-cycle expectations... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx