Re: [GIT PULL] bcachefs changes for 6.12-rc1

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 24 Sep 2024 22:13:01 -0400

On Wed, Sep 25, 2024 at 11:00:10AM GMT, Dave Chinner wrote:
> > Eh? Of course it'd have to be coherent, but just checking if an inode is
> > present in the VFS cache is what, 1-2 cache misses? Depending on hash
> > table fill factor...
> 
> Sure, when there is no contention and you have CPU to spare. But the
> moment the lookup hits contention problems (i.e. we are exceeding
> the cache lookup scalability capability), we are straight back to
> running a VFS cache speed instead of uncached speed.

The cache lookups are just reads; they don't introduce scalability
issues, unless they're contending with other cores writing to those
cachelines - checking if an item is present in a hash table is trivial
to do locklessly.

But pulling an inode into and then evicting it from the inode cache
entails a lot more work - just initializing a struct inode is
nontrivial, and then there's the (multiple) shared data structures you
have to manipulate.

> Keep in mind that not having a referenced inode opens up the code to
> things like pre-emption races. i.e. a cache miss doesn't prevent
> the current task from being preempted before it reads the inode
> information into the user buffer. The VFS inode could bei
> instantiated and modified before the uncached access runs again and
> pulls stale information from the underlying buffer and returns that
> to userspace.

Yeah, if you're reading from a buffer cache that doesn't have a lock
that does get dicy - but for bcachefs where we're reading from a btree
node that does have a lock it's quite manageable.

And incidentally this sort of "we have a cache on top of the btree, but
sometimes we have to do direct access" is already something that comes
up a lot in bcachefs, primarily for the alloc btree. _Tons_ of fun, but
doesn't actually come up here for us since we don't use the vfs inode
cache as a writeback cache.

Now, for some completely different sillyness, there's actually _three_
levels of caching for inodes in bcachefs: btree node cache, btree key
cache, then the vfs cache. In the first two inodes are packed down to
~100 bytes so it's not that bad, but it does make you go "...what?". It
would be nice in theory to collapse - but the upside is that we don't
have the interactions between the vfs inode cache and journalling that
xfs has.

But if vfs inodes no longer have their own lifetime like you've been
talking about, that might open up interesting possibilities.