On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote: > > Eh? Of course it'd have to be coherent, but just checking if an inode is > > present in the VFS cache is what, 1-2 cache misses? Depending on hash > > table fill factor... > > Sure, when there is no contention and you have CPU to spare. But the > moment the lookup hits contention problems (i.e. we are exceeding > the cache lookup scalability capability), we are straight back to > running a VFS cache speed instead of uncached speed. The cache lookups are just reads; they don't introduce scalability issues, unless they're contending with other cores writing to those cachelines - checking if an item is present in a hash table is trivial to do locklessly. But pulling an inode into and then evicting it from the inode cache entails a lot more work - just initializing a struct inode is nontrivial, and then there's the (multiple) shared data structures you have to manipulate. > Keep in mind that not having a referenced inode opens up the code to > things like pre-emption races. i.e. a cache miss doesn't prevent > the current task from being preempted before it reads the inode > information into the user buffer. The VFS inode could bei > instantiated and modified before the uncached access runs again and > pulls stale information from the underlying buffer and returns that > to userspace. Yeah, if you're reading from a buffer cache that doesn't have a lock that does get dicy - but for bcachefs where we're reading from a btree node that does have a lock it's quite manageable. And incidentally this sort of "we have a cache on top of the btree, but sometimes we have to do direct access" is already something that comes up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but doesn't actually come up here for us since we don't use the vfs inode cache as a writeback cache. Now, for some completely different sillyness, there's actually _three_ levels of caching for inodes in bcachefs: btree node cache, btree key cache, then the vfs cache. In the first two inodes are packed down to ~100 bytes so it's not that bad, but it does make you go "...what?". It would be nice in theory to collapse - but the upside is that we don't have the interactions between the vfs inode cache and journalling that xfs has. But if vfs inodes no longer have their own lifetime like you've been talking about, that might open up interesting possibilities.