On Tue, Oct 31, 2023 at 12:18:48AM +0000, Al Viro wrote: > On Mon, Oct 30, 2023 at 12:18:28PM -1000, Linus Torvalds wrote: > > On Mon, 30 Oct 2023 at 11:53, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > After fixing a couple of brainos, it seems to work. > > > > This all makes me unnaturally nervous, probably because it;s overly > > subtle, and I have lost the context for some of the rules. > > A bit of context: I started to look at the possibility of refcount overflows. > Writing the current rules for dentry refcounting and lifetime down was the > obvious first step, and that immediately turned into an awful mess. > > It is overly subtle. Another piece of too subtle shite: ordering of ->d_iput() of child and __dentry_kill() of parent. As it is, in some cases it is possible for the latter to happen before the former. It is *not* possible in the cases when in-tree ->d_iput() instances actually look at the parent (all of those are due to sillyrename stuff), but the proof is convoluted and very brittle. The origin of that mess is in the interaction of shrink_dcache_for_umount() with shrink_dentry_list(). What we want to avoid is a directory looking like it's busy since shrink_dcache_for_umount() doesn't see any children to account for positive refcount of parent. The kinda-sorta solution we use is to decrement the parent's refcount *before* __dentry_kill() of child and put said parent into a shrink list. That makes shrink_dcache_for_umount() do the right thing, but it's possible to end up with parent freed before the child is done with; scenario is non-obvious, and rather hard to hit, but it's not impossible. dput() does no such thing - it does not decrement the parent's refcount until the child had been taken care of. That's fine, as far as shrink_dcache_for_umount() is concerned - this is not a false positive; with slightly different timing shrink_dcache_for_umount() would've reported the child as being busy. IOW, there should be no overlap between dput() in one thread and shrink_dcache_for_umount() in another. Unfortunately, memory eviction *can* come in the middle of shrink_dcache_for_umount(). Life would be much simpler if shrink_dentry_list() would not have to pull that kind of tricks and used the same ordering as dput() does. IMO there's a reasonably cheap way to achieve that: * have shrink_dcache_for_umount() mark the superblock (either in ->s_flags or inside the ->s_dentry_lru itself) and have the logics in retain_dentry() that does insertion into LRU list check ->d_sb for that mark, treating its presence as "do not retain". * after marking the superblock shrink_dentry_for_umount() is guaranteed that nothing new will be added to shrink list in question. Have it call shrink_dcache_sb() to drain LRU. * Now shrink_dentry_list() in one thread hitting a dentry on a superblock going throug shrink_dcache_for_umount() in another thread is always a bug and reporting busy dentries is the right thing to do. So we can switch shrink_dentry_list() to the same "drop reference to parent only after the child had been killed" ordering as we have in dput(). IMO that removes a fairly nasty trap for ->d_iput() and ->d_delete() instances. As for the overhead, the relevant fragment of retain_dentry() is if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) d_lru_add(dentry); else if (unlikely(!(dentry->d_flags & DCACHE_REFERENCED))) dentry->d_flags |= DCACHE_REFERENCED; return true; That would become if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST))) { if (unlikely(dentry->d_sb is marked)) return false; d_lru_add(dentry); } else if (unlikely(!(dentry->d_flags & DCACHE_REFERENCED))) dentry->d_flags |= DCACHE_REFERENCED; return true; Note that d_lru_add() will hit ->d_sb->s_dentry_lru, so we are not adding memory traffic here; the else if part doesn't need to be touched - we only need to prevent insertions into LRU. Comments?