Re: [GIT PULL] bcachefs changes for 6.12-rc1

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 24 Sep 2024 13:27:59 -0400

On Tue, Sep 24, 2024 at 09:57:13AM GMT, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > That's effectively what the patch did - it added a spinlock per hash
> > list.
> 
> Yeah, no, I like your patches, they all seem to be doing sane things
> and improve the code.
> 
> The part I don't like is how the basic model that your patches improve
> seems quite a bit broken.
> 
> For example, that whole superblock list lock contention just makes me
> go "Christ, that's stupid". Not because your patches to fix it with
> Waiman's fancy list would be wrong or bad, but because the whole pain
> is self-inflicted garbage.
> 
> And it's all historically very very understandable. It wasn't broken
> when it was written.
> 
> That singly linked list is 100% sane in the historical context of "we
> really don't want anything fancy for this". The whole list of inodes
> for a superblock is basically unimportant, and it's main use (only
> use?) is for the final "check that we don't have busy inodes at umount
> time, remove any pending stuff".
> 
> So using a stupid linked list was absolutely the right thing to do,
> but now that just the *locking* for that list is a pain, it turns out
> that we probably shouldn't use a list at all. Not because the list was
> wrong, but because a flat list is such a pain for locking, since
> there's no hierarchy to spread the locking out over.

Well, lists are just a pain. They should really be radix trees, we're
not doing actual LRU anyways.

> > [ filesystems doing their own optimized thing ]
> >
> > IOWs, it's not clear to me that this is a caching model we really
> > want to persue in general because of a) the code duplication and b)
> > the issues such an inode life-cycle model has interacting with the
> > VFS life-cycle expectations...
> 
> No, I'm sure you are right, and I'm just frustrated with this code
> that probably _should_ look a lot more like the dcache code, but
> doesn't.
> 
> I get the very strong feeling that we should have a per-superblock
> hash table that we could also traverse the entries of. That way the
> superblock inode list would get some structure (the hash buckets) that
> would allow the locking to be distributed (and we'd only need one lock
> and just share it between the "hash inode" and "add it to the
> superblock list").

I was considering that - that's what the bcachefs btree key cache does,
the shrinker just iterates over the rhashtable.

But referenced inodes aren't kept on the shrinker lists; I hate the
overhead of adding/removing, but Dave was emphatic on there being
workloads where that was really necessary and I'm somewhat convinced.

The other big issue is memcg; I think for that to work we'd need reclaim
to coalesce shrinker calls and shrink from multiple memcgs at the same
time.