On Sun, Nov 05, 2023 at 07:54:16PM +0000, Al Viro wrote: > On Tue, Oct 31, 2023 at 12:18:48AM +0000, Al Viro wrote: > > On Mon, Oct 30, 2023 at 12:18:28PM -1000, Linus Torvalds wrote: > > > On Mon, 30 Oct 2023 at 11:53, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > After fixing a couple of brainos, it seems to work. > > > > > > This all makes me unnaturally nervous, probably because it;s overly > > > subtle, and I have lost the context for some of the rules. > > > > A bit of context: I started to look at the possibility of refcount overflows. > > Writing the current rules for dentry refcounting and lifetime down was the > > obvious first step, and that immediately turned into an awful mess. > > > > It is overly subtle. > > Another piece of too subtle shite: ordering of ->d_iput() of child > and __dentry_kill() of parent. As it is, in some cases it is possible for > the latter to happen before the former. It is *not* possible in the cases > when in-tree ->d_iput() instances actually look at the parent (all of those > are due to sillyrename stuff), but the proof is convoluted and very brittle. > > The origin of that mess is in the interaction of shrink_dcache_for_umount() > with shrink_dentry_list(). What we want to avoid is a directory looking like > it's busy since shrink_dcache_for_umount() doesn't see any children to account > for positive refcount of parent. The kinda-sorta solution we use is to decrement > the parent's refcount *before* __dentry_kill() of child and put said parent > into a shrink list. That makes shrink_dcache_for_umount() do the right thing, > but it's possible to end up with parent freed before the child is done with; > scenario is non-obvious, and rather hard to hit, but it's not impossible. D'oh... We actually don't need to worry about eviction on memory pressure at that point; unregister_shrinker() is done early enough to prevent that. So shrink_dcache_for_umount() does not need to worry about shrink lists use in prune_dcache_sb(). Use in namespace_unlock() is guaranteed that all dentries involved either have a matching mount in the list of mounts to be dropped (and thus protected from simultaneous fs shutdown) or have a matching mount pinned by the caller. Use in mntput_no_expire() is guaranteed the same - all dentries involved are on superblock of mount we are going to drop after the call of shrink_dentry_list(). All other users also either have an active reference to superblock or are done by ->kill_sb() synchronously (and thus can't race with shrink_dcache_for_umount()) or are done async, but flushed and/or waited for by foofs_kill_sb() before they get to shrink_dcache_for_umount(). IOW, I'd been too paranoid in "Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists" - the real requirements are milder; in-tree users didn't need these games with parent. Dcache side of Tobin's Slab Movable Objects patches needed those, though... AFAICS, there are 3 options: 1) leave the current weirdness with ->d_iput() on child vs __dentry_kill() on parent. Document the requirement to ->d_iput() (and ->d_release()) to cope with that, promise that in case of sillyrename the ordering will be there and write down the proof of that. No code changes, rather revolting docs to write, trouble waiting to happen in ->d_iput(). 2) require that shrink_dentry_list() should never overlap with shrink_dcache_for_umount() on any of the filesystems represented in the shrink list, guarantee that parent won't get to __dentry_kill() before the child gets through __dentry_kill() completely and accept that resurrecting SMO stuff will require more work. Smallish patch, tolerable docs, probably the best option at the moment. 3) bite the bullet and get shrink_dentry_list() to coexist with shrink_dcache_for_umount(), with sane ordering of ->d_iput() vs. parent's __dentry_kill(). Doable, but AFAICS it will take a counter of children currently being killed in the parent dentry. shrink_dentry_list() would bump that on parent, __dentry_kill() the victim, then relock the parent and decrement that counter along with the main refcount. That would allow the shrink_dcache_for_umount() to cope with that crap. No requirements for shrink_dentry_kill() callers that way, sane environment for ->d_iput(), no obstacles for SMO stuff. OTOH, we need to get space for additional counter in struct dentry; again, doable (->d_subdirs/->d_child can be converted to hlist, saving us a pointer in each dentry), but... I'd leave that option alone until something that needs it would show up (e.g. if/when Tobin resurrects his patchset). My preference would be (2) for the coming cycle + prototype of a patch doing (3) on top of that for the future. Completely untested diff for (2) (on top of #work.dcache, sans the documentation update) below: diff --git a/fs/dcache.c b/fs/dcache.c index ccf41c5ee804..c978207f3fc4 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1086,10 +1086,27 @@ void d_prune_aliases(struct inode *inode) } EXPORT_SYMBOL(d_prune_aliases); +static inline void shrink_kill(struct dentry *victim, struct list_head *list) +{ + struct dentry *parent = victim->d_parent; + + __dentry_kill(victim); + + if (parent == victim || lockref_put_or_lock(&parent->d_lockref)) + return; + + if (!WARN_ON_ONCE(parent->d_lockref.count != 1)) { + parent->d_lockref.count = 0; + to_shrink_list(parent, list); + } + spin_unlock(&parent->d_lock); +} + + void shrink_dentry_list(struct list_head *list) { while (!list_empty(list)) { - struct dentry *dentry, *parent; + struct dentry *dentry; dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(&dentry->d_lock); @@ -1106,10 +1123,7 @@ void shrink_dentry_list(struct list_head *list) } rcu_read_unlock(); d_shrink_del(dentry); - parent = dentry->d_parent; - if (parent != dentry && !--parent->d_lockref.count) - to_shrink_list(parent, list); - __dentry_kill(dentry); + shrink_kill(dentry, list); } } @@ -1537,19 +1551,14 @@ void shrink_dcache_parent(struct dentry *parent) break; data.victim = NULL; d_walk(parent, &data, select_collect2); - if (data.victim) { - struct dentry *parent; + if (data.victim) { // rcu_read_lock held - see select_collect2() spin_lock(&data.victim->d_lock); if (!lock_for_kill(data.victim)) { spin_unlock(&data.victim->d_lock); rcu_read_unlock(); } else { rcu_read_unlock(); - parent = data.victim->d_parent; - if (parent != data.victim && - !--parent->d_lockref.count) - to_shrink_list(parent, &data.dispose); - __dentry_kill(data.victim); + shrink_kill(data.victim, &data.dispose); } } if (!list_empty(&data.dispose))