On Mon, Feb 24, 2025 at 01:06:24AM +0000, Al Viro wrote: > Recently I went looking through the ->d_flags locking (long > story, that was a side branch in digging through the issues with > Neil's locking scheme); results are interesting. > > All stores to dentry->d_flags are done by somebody who > * holds a counting reference to dentry or > * has found a counting reference in shared data structures, in > conditions that guarantee that this reference won't go away or > * has found dentry in alias list of some inode (under ->i_lock) > * [fs/dcache.c only] is an LRU walker callback running into that dentry > in LRU list or > * [fs/dcache.c only] is owner of a shrink list running into dentry in that > list or > * [fs/dcache.c only] is a d_walk() callback running into that dentry with zero > refcount (and moving it to shrink list) or > * [fs/dcache.c only] is a d_walk() called by kill_litter_super() running into > that dentry when unpinning theretofore persistent dentries; that can happen > only during filesystem shutdown, when nobody else could be accessing it. > > The above guarantees that dentry won't disappear under us; > another interesting thing is exclusion, and that's where the things > get nasty. Most of the stores to dentry->d_flags are under > dentry->d_lock. There are obvious exceptions on the allocation side > (stores done to dentry that is not visible to anybody else), but aside > of those there are two exceptions. > > One is d_set_d_op(), another - setting DCACHE_PAR_LOOKUP in > d_alloc_parallel(). > > The former sets ->d_op and marks the presense of several methods > (->d_hash(), ->d_compare(), ->d_revalidate(), ->d_weak_revalidate(), > ->d_delete(), ->d_prune() and ->d_real()) in ->d_flags. > > It can't be done more than once and, if the filesystem > has ->s_d_op set in its superblock, it is done by the constructor > (__d_alloc()). That, obviously, falls under the "on allocation side"; > so does another common case - d_alloc_pseudo() setting ->d_op to &anon_ops > if __d_alloc() hadn't set it to ->s_d_op. > > Note that there's no barriers between the stores to ->d_op and > ->d_flags in d_set_d_op(); for allocation time uses that's not a problem - > fetches on another CPU would have to be preceded by finding the dentry > in the first place, and barriers on the insertion into wherever it > had been found would suffice. > > There are other callers of d_set_d_op(), though - one in > simple_lookup() (again, if ->s_d_op is NULL) and the rest are all in > procfs. Those are done with no locking whatsoever and dentry is *not* > entirely invisible. > > It's mostly invisible, though - in all those cases dentry is > * negative, and thus unreachable via the alias list of any inode > * unhashed, and thus can't be found via dcache lookup > * has the only direct reference held by the caller who has been holding > it since it got allocated (i.e. it couldn't have been put into LRU or > shrink lists either). > > That is enough to guarantee that nobody else will be doing stores > to ->d_flags, so our store is safe even without ->d_lock. It is also > fucking ugly and brittle... > > Moreover, the question about ->d_op vs ->d_flags ordering also > needs to be dealt with - unlike the calls at allocation time, insertion > into wherever it had been found does *not* order fetches past both > stores - not if it had been inserted into that wherever before the > call of d_set_d_op(). > > Thankfully, the set of methods present in dentry_operations > ever fed to those late calls of d_set_d_op() is limited: ->d_revalidate, > ->d_delete and, in one case, ->d_compare. > > ->d_delete() is easy - it's called only from one place (retain_dentry()) > and there we have just dropped the last reference to dentry in question. > Since the caller of d_set_d_op() is holding a reference, the usual barriers > on ->d_reflock use are enough. > > ->d_compare() is a bit confusing - dentry it's getting as argument > is not the one whose method it is. We have the parent (already observed > to be positive) and we are checking if this child (with ->d_parent pointing > to parent) matches the name we want to look up. We take parent's > ->d_compare() and give it child dentry, snapshot of child's name and the > name we want to match it against. > We *CAN* get a dentry in the middle of d_set_d_op() passed to > someone's ->d_compare() - d_alloc_parent() does that to check if there's > an in-lookup dentry matching the name we want. But ->d_compare() comes > from parent, and that had been already observed to be positive. Which means > that barriers in __d_set_inode_and_type() (from d_splice_alias()) suffice. > > ->d_revalidate() is only called for dentries that had been found > in dcache hash chains at some point, so there the barrier on insertion into > hash (in __d_add() from d_splice_alias()) is enough. > > That covers d_set_d_op() callers; another exception is > d_alloc_parallel() when it decides to insert a new dentry into in-lookup > hash and marks it with DCACHE_PAR_LOOKUP. Also safe, since dentry is > only visible in the parent's list of children, has positive refcount and > had it all along, so nobody else would try to do a store to ->d_flags > at the same time. > > In case it's not obvious from the above, I'm less than happy with > the entire thing - it may be provably correct, but it's much too brittle. > > If nothing else, d_set_d_op() should be unexported. Do it to Agreed. > a hashed or, worse, a positive dentry and you are asking for serious > trouble. Leaving it as a public API is a really bad idea. > > Something along the lines of d_splice_alias_ops(inode, dentry, ops) > (not exported, until we get a convincing modular user) is worth doing; > all procfs callers of d_set_d_op() follow it with d_splice_alias() pretty > much immediately. And yes, that could be done under ->d_lock, eliminating > that special case from the proof. That sounds great. > > As for the allocation-time uses... We could bloody well calculate > the ->d_flags bits to go along with ->s_d_op and just use that; it's not > just about getting rid of recalculating them for each dentry ever allocated > on the filesystem in question, we could get rid of quite a few always_delete_dentry > users while we are at it. See my reply to your other mail: I'll kill it for pidfs and nsfs. > > Look: ->d_delete == always_delete_dentry (and DCACHE_OP_DELETE to > go with it) is equivalent to DCACHE_DONTCACHE; the only place where we Also mentioned in my other reply: Can you please make the unhashed case really explicit ideally at dentry allocation time. IOW, that there's a flag or some other way of simply identifying a dentry as belonging to an fs that will never hash them? > look at either is retain_dentry(), where we have this: > // ->d_delete() might tell us not to bother, but that requires > // ->d_lock; can't decide without it > if (unlikely(d_flags & DCACHE_OP_DELETE)) { > if (!locked || dentry->d_op->d_delete(dentry)) > return false; > } > > // Explicitly told not to bother > if (unlikely(d_flags & DCACHE_DONTCACHE)) > return false; > The inner if turns into > if (!locked || 1) > return false; > so for those DCACHE_DONTCACHE would be equivalent. And it could be > put into the same "set those bits in all ->d_flags on that fs"; > what's more, simple_lookup() doesn't need to set ->d_op at all - > it can just set DCACHE_DONTCACHE in the unlikely case when it's > not been already set. > > How about something along the lines of > 0) add d_splice_alias_ops(inode, dentry, dops), have procfs switch > to that. > 1) provide set_default_d_op(superblock, dops), use it in place of > assignments to ->s_d_op. Rename ->s_d_op to catch unconverted > filesystems. Tree-wide, entirely mechanical. > 2) split the calculation of d_flags bits into a separate helper, > add ->s_d_flags, have set_default_d_op() calculate and set > that, have __d_alloc() pick ->s_d_op and ->s_d_flags directly. > 3) convert those who wish to move from use of always_delete_dentry > to adding DCACHE_DONTCACHE into ->s_d_flags. For devpts, for > example, that avoids the need of non-NULL ->d_op. > 4) replace d_set_d_op() in simple_lookup() with > if (unlikely(!(dentry->d_flags & DCACHE_DONTCACHE))) { > spin_lock(&dentry->d_lock); > dentry->d_flags |= DCACHE_DONTCACHE; > spin_unlock(&dentry->d_lock); > } > 5) Kill simple_dentry_operations - no users would be left > 6) make d_set_d_op() static in fs/dcache.c Sounds good.