Re: [GIT PULL] bcachefs changes for 6.12-rc1

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 24 Sep 2024 13:55:35 +1000

On Mon, Sep 23, 2024 at 07:48:03PM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 19:26, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > And I had missed the issue with PREEMPT_RT and the fact that right now
> > the inode hash lock is outside the inode lock, which is problematic.
> 
> .. although I guess that could be solved by just making the spinlock
> be external to the hash list, rather than part of the list like
> list_bh.

That's effectively what the patch did - it added a spinlock per hash
list.

> You wouldn't need to do one lock per list head, you could just do some
> grouping of the hash.

Yes, that's a more complex thing to do, though - it is effectively
using hashed locks for the hash table.

> That said, the vfs inode hash code really is pretty disgusting and has
> been accumulating these warts for a long time. So maybe a "filesystems
> do their own thing" together with some helper infrastructure (like
> using rhashtable) is actually the right thing to do.

Perhaps.

XFS has done it's own thing since 2007, but that was because the XFS
inode lifecycle has always existed outside the VFS inode life cycle.
i.e. the inode has to outlive evict() and ->destroy_inode because it
can still be dirty and tracked by the journal when the VFS evicts it
from the cache.

We also have internal inode lookup stuff that we don't want to go
anywhere near the VFS (e.g. inode writeback clustering from journal
item cleaning callbacks).

But this has come at a significant complexity cost, and some of the
biggest problems we've had to solve have been a result of this
architecture. e.g. memory reclaim getting blocked on dirty XFS
inodes that the VFS is unaware of via the fs specific callout in the
superblock shrinker caused all sorts of long tail memory allocation
latency issues for years. We've fixed that, but it took a long time
to work out how to avoid blocking in memory reclaim without driving
the system immediately into OOM kill frenzies....

There are also issues around RCU freeing of inodes and how the
filesysetm level cache handles that. e.g. what happens when the
filesystem doesn't immediately free the inode after memory reclaim
evicts it, but then the fs re-instantiates it moments later when a
new lookup for that inode comes in? The rest of the VFS assumes that
the inode won't reappear until a RCU grace period expires, but
having the re-instantiation code call synchronise_rcu() (or
something similar) is not an option because this sort of situation
can occur thousands of times a second under memory pressure....

We know this because there is an outstanding issue in XFS around
this situation - none of the general solutions to the problem
we've tried have been viable. I suspect that any other filesystem
that implements it's own inode cache and does VFS inode
re-instantiation on a inode cache hit will have similar issues,
and so there will be wider VFS architecture impacts as a result.

IOWs, it's not clear to me that this is a caching model we really
want to persue in general because of a) the code duplication and b)
the issues such an inode life-cycle model has interacting with the
VFS life-cycle expectations...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx