[PATCHES][RFC][CFT] ->d_revalidate() calling conventions changes (->d_parent/->d_name stability problems)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[this had been more than a year in making; my apologies for getting sidetracked
several times]

	Locking rules for dentry->d_name and dentry->d_parent are seriously
unpleasant and historically we had quite a few races in that area.  Filesystem
methods mostly don't have to care about that since the locking in VFS callers
is enough to guarantee that dentries passed to the methods won't be renamed
or moved behind the method's back.

	Directory-modifying methods (->mknod(), ->mkdir(), ->unlink(), ->rename(),
->link(), etc.) are guaranteed that their dentry arguments will have names and
parents unchanging through the method exectuction.  For ->lookup() the warranties
are slightly weaker (they disappear once an inode has been attached to dentry),
but they still cover most of the execution.  As the result, all these methods
can safely access ->d_parent and ->d_name.

	->d_revalidate() is an exception - the caller might be holding no locks
whatsoever and both the name and parent may be changing right under us.
Locally you can hold dentry->d_lock - that'll stabilize both ->d_parent and
->d_name, but you obviously can't hold that over any IO and as soon as you drop
->d_lock, you are on your own.

	There is a rather convoluted dance needed to get a safe reference
to parent -
	if not in RCU mode
		parent = dget_parent(dentry)
		dir = d_inode(parent)
	else
		parent = READ_ONCE(dentry->d_parent)
		dir = d_inode_rcu(parent)
		if (!dir)
			return -ECHILD
	<do actual work>
	if not in RCU mode
		dput(parent)
and it's duplicated in a bunch of instances (not all of them - quite a few
->d_revalidate() instances do not care about the parent *or* name in the
first place).  For names... you can safely access that under ->d_lock
(including copying it someplace safe) or you can use take_dentry_name_snapshot(),
but blind dereferencing of ->d_name.name is really asking for trouble -
for a long name you might end up accessing freed memory.

	An obvious improvement would be to pass safe references to parent
and name as explicit arguments of ->d_revalidate().  Examining the in-tree
instances shows that we have 4 groups:
	1) really don't care about parent at all: hfs, jfs, all procfs ones,
tracefs.
	2) want only the inode of parent directory: afs, ceph, exfat and
vfat ones, fscrypt, fuse, gfs2, nfs, ocfs2, orangefs
	3) don't use the parent directly, no help from that calling conventions
change: smb, 9p, vboxsf, coda, kernfs.
	4) really special: ecryptfs, overlayfs.
In other words, passing the parent's inode is more useful than passing its
dentry, ending up with
	int (*d_revalidate)(struct inode *dir, const struct qstr *name,
                            struct dentry *dentry, unsigned int flags);

	That, however, presumes that we can get these stable references in
the callers without a serious overhead.  Thankfully, there are only 3 callers
of ->d_revalidate() in the entire tree.  The regular one is in
fs/namei.c:d_revalidate() and that's what the pathname resolution is using.
Additionally, ecryptfs and overlayfs instances of ->d_revalidate() may
want to call that method for dentries in underlying filesystems.

	fs/namei.c:d_revalidate() callers already have stable references
to parent and name - we are calling that right after we'd found our dentry
in dcache and we bloody well know which parent/name combination we'd been
looking for.  So in this case no convolutions are needed - we already have
the values of extra arguments for ->d_revalidate().

	In case of ecryptfs and overlayfs deciding to call ->d_revalidate()
for underlying dentries we can just use take_dentry_name_snapshot() on that
underlying dentry to get the stable name and either do the aforementioned
convoluted dance to get a stable reference to parent (in case of overlayfs)
or use the directory underlying the parent of our dentry (in case of ecryptfs).

	That allows to get rid of boilerplate in the instances and allows
to close some actual races wrt ->d_name uses.  The series below attempts to
do just that.  It lives in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.d_revalidate
itself on top of #work.dcache.

	Individual patches in followups; please, review.

Shortlog (including #work.dcache):
Al Viro (20):
      make sure that DNAME_INLINE_LEN is a multiple of word size
      dcache: back inline names with a struct-wrapped array of unsigned long
      make take_dentry_name_snapshot() lockless
      dissolve external_name.u into separate members
      ext4 fast_commit: make use of name_snapshot primitives
      generic_ci_d_compare(): use shortname_storage
      Pass parent directory inode and expected name to ->d_revalidate()
      afs_d_revalidate(): use stable name and parent inode passed by caller
      ceph_d_revalidate(): use stable parent inode passed by caller
      ceph_d_revalidate(): propagate stable name down into request enconding
      fscrypt_d_revalidate(): use stable parent inode passed by caller
      exfat_d_revalidate(): use stable parent inode passed by caller
      vfat_revalidate{,_ci}(): use stable parent inode passed by caller
      fuse_dentry_revalidate(): use stable parent inode and name passed by caller
      gfs2_drevalidate(): use stable parent inode and name passed by caller
      nfs{,4}_lookup_validate(): use stable parent inode passed by caller
      nfs: fix ->d_revalidate() UAF on ->d_name accesses
      ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
      orangefs_d_revalidate(): use stable parent inode and name passed by caller
      9p: fix ->rename_sem exclusion

Diffstat (again, including #work.dcache):
 Documentation/filesystems/locking.rst        |  5 +-
 Documentation/filesystems/porting.rst        | 13 ++++
 Documentation/filesystems/vfs.rst            | 22 ++++++-
 fs/9p/v9fs.h                                 |  2 +-
 fs/9p/vfs_dentry.c                           | 23 ++++++-
 fs/afs/dir.c                                 | 40 ++++--------
 fs/ceph/dir.c                                | 25 ++------
 fs/ceph/mds_client.c                         |  9 ++-
 fs/ceph/mds_client.h                         |  2 +
 fs/coda/dir.c                                |  3 +-
 fs/crypto/fname.c                            | 22 ++-----
 fs/dcache.c                                  | 96 ++++++++++++++++------------
 fs/ecryptfs/dentry.c                         | 18 ++++--
 fs/exfat/namei.c                             | 11 +---
 fs/ext4/fast_commit.c                        | 29 ++-------
 fs/ext4/fast_commit.h                        |  3 +-
 fs/fat/namei_vfat.c                          | 19 +++---
 fs/fuse/dir.c                                | 14 ++--
 fs/gfs2/dentry.c                             | 31 ++++-----
 fs/hfs/sysdep.c                              |  3 +-
 fs/jfs/namei.c                               |  3 +-
 fs/kernfs/dir.c                              |  3 +-
 fs/libfs.c                                   | 15 +++--
 fs/namei.c                                   | 18 +++---
 fs/nfs/dir.c                                 | 62 ++++++++----------
 fs/nfs/namespace.c                           |  2 +-
 fs/nfs/nfs3proc.c                            |  5 +-
 fs/nfs/nfs4proc.c                            | 20 +++---
 fs/nfs/proc.c                                |  6 +-
 fs/ocfs2/dcache.c                            | 14 ++--
 fs/orangefs/dcache.c                         | 20 +++---
 fs/overlayfs/super.c                         | 22 ++++++-
 fs/proc/base.c                               |  6 +-
 fs/proc/fd.c                                 |  3 +-
 fs/proc/generic.c                            |  6 +-
 fs/proc/proc_sysctl.c                        |  3 +-
 fs/smb/client/dir.c                          |  3 +-
 fs/tracefs/inode.c                           |  3 +-
 fs/vboxsf/dir.c                              |  3 +-
 include/linux/dcache.h                       | 22 +++++--
 include/linux/fscrypt.h                      |  7 +-
 include/linux/nfs_xdr.h                      |  2 +-
 tools/testing/selftests/bpf/progs/find_vma.c |  2 +-
 43 files changed, 336 insertions(+), 304 deletions(-)

	Overview (#work.dcache is the first 6 commits in there):

Part 1: hopefully cheaper take_dentry_name_snapshot() and handling of inline
(short) names.  One surprising thing was that gcc __builtin_memcpy() does
*not* make use of the alignment information; turns out that it's better to
wrap the entire short name into an object that can be copied by assignment.

01/20)   make sure that DNAME_INLINE_LEN is a multiple of word size
	Linus' suggestion to define the size of shortname in terms of
unsigned long words and derive its size in bytes from that.  Cleaner
that way.
02/20)   dcache: back inline names with a struct-wrapped array of unsigned long
	... so that they can be copied with struct assignment (which
generates better code) and accessed word-by-word.
	The type is union shortname_storage; it's a union of arrays of
unsigned char and unsigned long.
	struct name_snapshot.inline_name turned into union
shortname_storage; users (all in fs/dcache.c) adjusted.
	struct dentry.d_iname has some users outside of fs/dcache.c;
to reduce the amount of noise in commit, it is replaced with union
shortname_storage d_shortname and d_iname is turned into a macro that
expands to d_shortname.string (similar to d_lock handling)
03/20)   make take_dentry_name_snapshot() lockless
	Use ->d_seq instead of grabbing ->d_lock; in case of shortname
dentries that avoids any stores to shared data objects and in case of
long names we are down to (unavoidable) atomic_inc on the external_name
refcount.  Makes the thing safer as well - the areas where ->d_seq is held
odd are all nested inside the areas where ->d_lock is held, and the latter
are much more numerous.  NOTE: now that there is a lockless path where
we might try to grab a reference to an already doomed external_name
instance, it is no longer possible for external_name.u.count and
external_name.u.head to share space (kudos to Linus for spotting that).
To reduce the noice this commit just make external_name.u a struct
(instead of union); the next commit will dissolve it.
04/20)   dissolve external_name.u into separate members
	kept separate from the previous commit to keep the noise separate
from actual changes...
05/20)   ext4 fast_commit: make use of name_snapshot primitives
	... rather than open-coding them.  As a bonus, that avoids the
pointless work with extra allocations, etc. for long names.
06/20)   generic_ci_d_compare(): use shortname_storage
	... and check the "name might be unstable" predicate the right way.

Part 2: ->d_revalidate() calling conventions change.  The first commit
adds the method prototype and has the extra arguments supplied by the callers;
making use of those extra arguments is done in followup patches, so that
they can be reviewed separately.

07/20)   Pass parent directory inode and expected name to ->d_revalidate()
	->d_revalidate() often needs to access dentry parent and name;
that has to be done carefully, since the locking environment varies from
caller to caller.  We are not guaranteed that dentry in question will
not be moved right under us - not unless the filesystem is such that
nothing on it ever gets renamed.
	It can be dealt with, but that results in boilerplate code that
isn't even needed - the callers normally have just found the dentry
via dcache lookup and want to verify that it's in the right place; they
already have the values of ->d_parent and ->d_name stable.  There is a
couple of exceptions (overlayfs and, to less extent, ecryptfs), but for
the majority of calls that song and dance is not needed at all.
	It's easier to make ecryptfs and overlayfs find and pass those
values if there's a ->d_revalidate() instance to be called, rather than
doing that in the instances.
	This commit only changes the calling conventions; making use of
supplied values is left to followups.
	NOTE: some instances need more than just the parent - things like
CIFS may need to build an entire path from filesystem root, so they need
more precautions than the usual boilerplate.  This series doesn't
do anything to that need - these filesystems have to keep their locking
mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem
a-la v9fs).


Part 3: making use of the new arguments - getting rid of parent-obtaining
boilerplate in the instances and getting rid of racy uses of ->d_name in
some of those.  Split on per-filesystem basis.

08/20)   afs_d_revalidate(): use stable name and parent inode passed by caller
09/20)   ceph_d_revalidate(): use stable parent inode passed by caller
10/20)   ceph_d_revalidate(): propagate stable name down into request enconding
11/20)   fscrypt_d_revalidate(): use stable parent inode passed by caller
12/20)   exfat_d_revalidate(): use stable parent inode passed by caller
13/20)   vfat_revalidate{,_ci}(): use stable parent inode passed by caller
14/20)   fuse_dentry_revalidate(): use stable parent inode and name passed by caller
15/20)   gfs2_drevalidate(): use stable parent inode and name passed by caller
16/20)   nfs{,4}_lookup_validate(): use stable parent inode passed by caller
17/20)   nfs: fix ->d_revalidate() UAF on ->d_name accesses
18/20)   ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
19/20)   orangefs_d_revalidate(): use stable parent inode and name passed by caller


Part 4: dealing with races in access to ancestors' names/parents.  Changing the
calling conventions above helps with the name and parent of dentry being revalidated;
it doesn't do anything for its ancestors and some instances want to deal with the
entire path to filesystem root.  The way it's done varies from filesystem to
filesystem and often isn't limited to ->d_revalidate().  There is a safe helper
(dentry_path()) and e.g. smb avoids the races by using it.  9p and ceph do not,
for various reasons, and both have problems.  I've done a 9p fix (see below);
ceph one is trickier and I'd prefer to discuss it with ceph folks first.

20/20)   9p: fix ->rename_sem exclusion
	9p wants to be able to build a path from given dentry to fs root
and keep it valid over a blocking operation.
	->s_vfs_rename_mutex would be a natural candidate, but there
are places where we want that and where we have no way to tell if
->s_vfs_rename_mutex has already been taken deeper in callchain.
Moreover, it's only held for cross-directory renames; name changes within
the same directory happen without touching that lock.
Current mainline solution:
	* have d_move() done in ->rename() rather than in its caller
	* maintain a 9p-private rwsem (->rename_sem, per-filesystem)
	* hold it exclusive over the relevant part of ->rename()
	* hold it shared over the places where we want the path.
That almost works.  FS_RENAME_DOES_D_MOVE is enough to put all d_move()
and d_exchange() calls under filesystem's control.  However, there's
also __d_unalias(), which isn't covered by any of that.
	If ->lookup() hits a directory inode with preexisting dentry
elsewhere (due to e.g. rename done on server behind our back),
d_splice_alias() called by ->lookup() will move/rename that alias.
	One approach to fixing that would be to add a couple of optional
methods, so that __d_unalias() would do
	if alias->d_op->d_unalias_trylock != NULL
		if (!alias->d_op->d_unalias_trylock(alias))
			fail (resulting in -ESTALE from lookup)
	__d_move(...)
	if alias->d_op->d_unalias_unlock != NULL
		alias->d_unalias_unlock(alias)
where it currently does __d_move().  9p instances would be down_write_trylock()
and up_write() of ->rename_sem.
	However, to reduce dentry_operations bloat, let's add one method
instead - ->d_want_unalias(alias, true) instead of ->d_unalias_trylock(alias)
and ->d_want_unalias(alias, false) instead of ->d_unalias_unlock(alias).
	Another possible variant would be to hold ->rename_sem exclusive
around d_splice_alias() calls in 9p ->lookup(), but that would cause a lot
of contention on that rwsem (every lookup rather than only ones that end
up with __d_unalias()) and rwsem is filesystem-wide.  Let's not go there.




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux