On Tue, Sep 24, 2024 at 09:57:13AM GMT, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 20:55, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > That's effectively what the patch did - it added a spinlock per hash > > list. > > Yeah, no, I like your patches, they all seem to be doing sane things > and improve the code. > > The part I don't like is how the basic model that your patches improve > seems quite a bit broken. > > For example, that whole superblock list lock contention just makes me > go "Christ, that's stupid". Not because your patches to fix it with > Waiman's fancy list would be wrong or bad, but because the whole pain > is self-inflicted garbage. > > And it's all historically very very understandable. It wasn't broken > when it was written. > > That singly linked list is 100% sane in the historical context of "we > really don't want anything fancy for this". The whole list of inodes > for a superblock is basically unimportant, and it's main use (only > use?) is for the final "check that we don't have busy inodes at umount > time, remove any pending stuff". > > So using a stupid linked list was absolutely the right thing to do, > but now that just the *locking* for that list is a pain, it turns out > that we probably shouldn't use a list at all. Not because the list was > wrong, but because a flat list is such a pain for locking, since > there's no hierarchy to spread the locking out over. Well, lists are just a pain. They should really be radix trees, we're not doing actual LRU anyways. > > [ filesystems doing their own optimized thing ] > > > > IOWs, it's not clear to me that this is a caching model we really > > want to persue in general because of a) the code duplication and b) > > the issues such an inode life-cycle model has interacting with the > > VFS life-cycle expectations... > > No, I'm sure you are right, and I'm just frustrated with this code > that probably _should_ look a lot more like the dcache code, but > doesn't. > > I get the very strong feeling that we should have a per-superblock > hash table that we could also traverse the entries of. That way the > superblock inode list would get some structure (the hash buckets) that > would allow the locking to be distributed (and we'd only need one lock > and just share it between the "hash inode" and "add it to the > superblock list"). I was considering that - that's what the bcachefs btree key cache does, the shrinker just iterates over the rhashtable. But referenced inodes aren't kept on the shrinker lists; I hate the overhead of adding/removing, but Dave was emphatic on there being workloads where that was really necessary and I'm somewhat convinced. The other big issue is memcg; I think for that to work we'd need reclaim to coalesce shrinker calls and shrink from multiple memcgs at the same time.