[RFC] d_name/d_parent stability rules

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 13 Dec 2024 08:00:23 +0000

FWIW, I went through the tree again, trying to prove the locking
rules for dentry->d_name and dentry->d_parent.  It took about 5 hours
and left me seriously wondering about the toolchain assistance
possible in that area - repeating that kind of audit every once
in a while is _not_ pleasant.

Anyway, results:

* only positive dentries can be moved.

* all changes are under write_seqlock(&rename_lock)

* they are also under write_seqcount_begin(&dentry->d_seq)

* ->d_lock is held on all dentries involved (dentry itself, its old
parent, its new parent)

* if dentry used to be attached, ->s_vfs_rename_mutex is held over
any ->d_parent change

* all changes are under ->i_rwsem of parent directory's inode (held at
least shared)

* if ->d_parent is changed, ->i_rwsem is held both for old and for new
parent directories (again, at least shared)

* the difference between weak and strong locking environments (parent(s)
held shared and exclusive resp.) is interesting - name/parent changes under
weak environment _mostly_ happen to directories.  That happens when lookup
finds a subdirectory that has a preexisting alias elsewhere - either due
to rename done on server in case of remote filesystem or due to corrupted
fs image.  There are filesystems where it can't happen at all (e.g. anything
with tree entirely in dcache), but filesystem being read-only is *not*
enough to guarantee it won't happen - for corrupted isofs image it is possible.

* there's a rare case when weak-environment renames can happen to non-directories:
vfat and exfat lookups finding a differently capitalized alias with the same
parent will rename it.  Parent is held only shared in that case and the object
is a non-directory.

For an example of the way that plays out, consider e.g. d_exact_alias().
We rely upon the aliases not being able to move while we are looking
through them.  The (sole) caller is holding the parent of our dentry
locked (possibly shared), the inode is a not a directory *and* it's
not on vfat or exfat (nfs, actually).  It means that all moves of its
aliases will be in strong environment.  We are holding the parent of our
dentry shared, so while ->d_parent of aliases is not stable, result of its
comparison with our ->d_parent is, so the logics that skips aliases with
the wrong parent is safe.  Aliases with the right parent are guaranteed
to have ->d_name and ->d_parent stable, so d_exact_alias() doesn't need
to bother with ->d_lock on aliases for name comparisons.

In case it's not obvious, I'm not fond of that one, especially since the
proof that parent is held at least shared is not pleasant: the call tree
is
_nfs4_open_and_get_state()
        <- _nfs4_do_open()
                <- nfs4_do_open()
                        <- nfs4_atomic_open()
                                == nfs_rpc_ops:open_context
                                        <- nfs_atomic_open()
                                                == ->atomic_open
                                        <- nfs4_file_open()
                                                == ->open
                        <- nfs4_proc_create()
                                == nfs_rpc_ops:create
                                        <- nfs_do_create()
                                                <- nfs_create()
                                                        == ->create
                                                <- nfs_atomic_open_v23(), with O_CREAT
                                                        == ->atomic_open
and while both ->create() and ->atomic_open() do have the parent held
at least shared, ->open() does *not*.  But in case of ->open() we
are guaranteed that dentry had been positive to start with, and the
call of d_exact_alias() in _nfs4_open_get_state() can happen only
if dentry is negative.  Since positive dentry can be made negative
only by the holder of the sole reference to it and since we are
holding a reference and do *not* hit it with d_delete(), it can't
become negative on that codepath and thus d_exact_alias() call in
there is guaranteed to have the parent held at least shared.

IMO that's awful.  Granted, this is one of the... highlights of that
code audit (the other one is in ceph, with its call graph from hell),
but...

I don't know what can be done to make those audits automated.
A part of that can be helped by making d_name const, aliasing it
with non-const struct qstr member - that, at least, helps with
proving that ->d_name is not modified outside of __d_move();
doing that manually is seriously time-consuming (grep and you'll
see).  But that's obviously a nasal daemon country.  Alternatively,
we could define d_name(dentry) returning const struct qstr *, and
have that used instead of direct accesses to ->d_name.  Conversion
will be a lot of noise, but it doesn't have to be done as a single
commit...

What we really need is some way to express invariants like
"->mkdir(_, dir, dentry, _) has dir held exclusive and
dentry->d_parent->d_inode == dir".  That's one area where things
like Rust could be useful - having the locking warranties expressed
via type system would be nice, but then the things like "call of
d_exact_alias() in _nfs4_open_and_get_state(opendata, _) happens only
with opendata->dentry->d_parent->d_inode held at least shared" are
probably beyond any help - having type system do an equivalent of
proof above is not realistic ;-/