On Tue, Sep 17, 2024 at 08:44:29AM +0100, Al Viro wrote: > Anyway, I'm half-asleep at the moment and I'd rather leave writing these > rules up until tomorrow. Sorry... [Below are the bits of my notes related to d_name and d_parent, with most of the unprintable parts thrown out and minimal markup added. Probably not all relevant notes are here - this has been culled from a bunch of files sitting around and I might've missed some] NOTE: ->d_parent and ->d_name are by far the worst parts of dentry wrt stability rules. Code audits are not pleasant, to put it mildly. This covers the bits outside of VFS proper - verifying the locking rules is a separate story. A Really Blunt Tool You Should Not Use. ====================================== All changes of ->d_parent are serialized on rename_lock. It's *NOT* something you want the stuff outside of core VFS to touch, though. It's a seqlock, and write_seqlock() on it is limited to fs/dcache.c alone. Reader side is allowed, but it's still not something you want to use lightly - outside of fs/dcache.c, fs/d_path.c and fs/namei.c there are only 3 users (ceph_mdsc_build_path(), nfs_path() and auditsc handle_path()). Don't add more without a discussion on fsdevel and detailed ACKs; it's quite likely that a better solution will be found. With one exception (see the discussion of d_mark_tmpfile() in the end), ->d_name is also stabilized by that. Slightly Less Blunt Tool You Still Should Not Use. ================================================== ->s_vfs_rename_mutex will stabilize ->d_parent. The trouble is, while it's not system-wide like rename_lock, it's fs-wide, so there's a plenty of contention to run into *AND* if you try that while ->i_rwsem is held on some directory in that filesystem, you are fucked - lock_rename() (and rename(2), and...) will deadlock on you. Nothing outside of fs/{namei,dcache}.c touches it directly; there is an indirect use (lock_rename()), but that should be done only around the call of cross-directory rename on _another_ filesystem - overlayfs moving stuff within the writable layer, ecryptfs doing rename on underlying filesystem, nfsd handling rename request from client, etc. Anyone trying to use that hammer without a good reason will be very sorry - that's a promise. The pain will begin with the request to adjust the proof of deadlock avoidance in directory locking and it will only go downwards from there... Parent's ->i_rwsem Held Exclusive. ================================== That stabilizes ->d_parent and ->d_name. To be more precise, holding parent->d_inode->i_rwsem exclusive stabilizes the result of (dentry->d_parent == parent). That is to say, dentry that is a child of parent will remain such and dentry that isn't a child won't become a child. holding parent->d_inode->i_rwsem exclusive stabilizes ->d_name of all children. In other words, if you've locked the parent exclusive and found something to be its child while keeping the parent locked, child will have ->d_parent and ->d_name stable until you unlock the parent. That covers most of the directory-modifying methods - stuff like ->mkdir(), ->unlink(), ->rename(), etc. can access ->d_parent and ->d_name of the dentry argument(s) without any extra locks; the caller is already holding enough. Well, unless you are special (*cough* apparmor *cough*) and feel like dropping and regaining the lock on parent inside your ->mkdir()... Don't do that, please - you might have no renames, but there's a plenty of other headache you can get into that way. Negatives. ========== Negative dentry doesn't change ->d_parent or ->d_name. Of course, that is only worth something if you are guaranteed that it won't become positive under you - if that happens, all bets are off. Holding the inode of parent locked (at least) shared is enough to guarantee that. That takes care of ->lookup() instances - their dentry argument has ->d_parent and ->d_name stable until you make it positive (normally - by d_splice_alias()). Once you've done d_splice_alias(), you'd better be careful with access to those; you won't get hit by concurrent rename() (it locks parent(s) exclusive), but if your inode is a directory and d_splice_alias() *elsewhere* picks the same inode (fs image corruption, network filesystem with rename done from another client behind your back, etc.), you'll see the sucker moved. In practice, d_splice_alias() is the last thing done by most of ->lookup() instances - whatever it has returned gets returned by ->lookup() itself, possibly after freeing some temporary allocations, etc. The rest needs to watch out for accesses to ->d_name and ->d_parent downstream of d_splice_alias() return. Another case where we are guaranteed that dentry is negative and will stay so is ->d_release() - it's called for a dentry that is well on the way to becoming an ex-parrot; it's already marked dead, unhashed and negative. So a ->d_release() instance doesn't have to worry about ->d_name and ->d_parent - both are valid and stable. sprintf(). ========== %pd prints dentry name, safely. %p2d - parent_name/dentry_name, etc. up to %p4d. %pD .. %p4D do the same by file reference. Any time you see pr_warn("Some weird bollocks with %s (%d)\n", dentry->d_name, err); it should've been pr_warn("Some weird bollocks with %pd (%d)\n", dentry, err)... d_path() and friends are there for purpose - don't open-code those without a damn good reason. Checking if one dentry is an ancestor of another. ================================================= Use the damn is_subdir(), don't open-code it. Spinlocks. ========== dentry->d_lock stabilizes ->d_parent and ->d_name (as well as almost everything else about dentry); downside is that it's a spinlock *and* nesting it is not to be attempted easily; you are allowed to lock child while holding lock on parent, but very few places have any business doing that (only 2 outside of VFS - tree-walking in autofs, which might eventually get redone avoiding that and fsnotify_set_children_dentry_flags(), which just might get moved to fs/dcache.c itself; we'll see) Note that "almost everything" includes refcount; that is to say, dget() and dput() will spin if you are holding ->d_lock, so you can't dget() the parent under ->d_lock on child - that's a locking order violation that can easily deadlock on you. dget_parent() does that kind of thing safely, and a look at it might be instructive. Try to open-code something of that sort, and you'll be hurt. Said that, dget_parent() is overused - it has legitimate uses, but more often than not it's the wrong tool. In particular, while you grab a reference to something that was the parent at some point during dget_parent(), it might not be the parent anymore by the time it is returned to caller. Most of the dget_parent() uses are due to bad calling conventions of ->d_revalidate(). When that gets sanitized, those will be gone. The methods that are called with ->d_lock get the protection - that would be ->d_delete() and ->d_prune(). ->d_lock on parent is also sufficient; similar to exclusive ->i_rwsem, parent->d_lock stabilizes (dentry->d_parent == parent) and if child has been observed to have child->d_parent equal to parent after you've locked parent->d_lock, you know that child->d_{parent,name} will remain stable until you unlock the parent. Name Snapshots. =============== There's take_dentry_name_snapshot()/release_dentry_name_snapshot(). That stuff eats about 64 bytes on stack (longer names _are_ handled correctly; no allocations are needed, we can simply grab an extra reference to external name and hold it). Can't be done under ->d_lock, won't do anything about ->d_parent *and* there's nothing to prevent dentry being renamed while you are looking at the name snapshot. Sometimes it's useful... RCU Headaches. ============== See e.g. https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=fixes.pathwalk-rcu-2&id=8d0a75eba81813cbb00beb73a67783e1cde9982f (NB: ought to repost that) [*] d_mark_tmpfile() is pretty much a delayed bit of constructor. There is a possible intermediate state of dentry ("will be tmpfile one"); dentries in that state are all created in vfs_tmpfile(), get passed to ->tmpfile() where they transition to normal unhashed postive dentries. The reason why name is not set from the very beginning is that at that point we do not know the inumber of inode they are going to get (that becomes known inside ->tmpfile() instance) and we want that inumber seen in their names (for /proc/*/fd/*, basically). Name change is done while holding ->d_lock both on dentry itself and on its parent.