Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Thu, 10 Mar 2016 23:23:14 +0000

On Thu, Mar 10, 2016 at 09:22:21PM +0000, Drokin, Oleg wrote:
> In fact I was surprised this was not the case already and so people
> needed to call it by hand in every possible case where a new dentry is possibly
> originated leading to a bit of a mess (and some bugs).

Most of the filesystems do not need that at all - just look at the
d_splice_alias() callers.  In mainline we have
	* 22 callers of form return d_splice_alias(inode, dentry); in
some ->lookup() instance.
	* 1 caller in d_add_ci(), all callers of which are in ->lookup()
instances, with return value directly returned to caller of ->lookup().
	* 1 caller in kernfs ->lookup(), with return value directly
returned after releasing a global mutex.
	* ceph - returned in a very convoluted way, but return values is
not dereferenced until it gets passed to caller of ->lookup()
	* gfs2 - passed to caller (or fed to finish_open() when it's in
atomic open) with some unlocks along the way.
	* ocfs2 - broken, and in a way that wouldn't be fixable at dentry
allocation time.
	* cifs - seeding dcache on readdir.  No postprocessing of dentry.
fs/9p/vfs_inode.c:834:  res = d_splice_alias(inode, dentry);
	* fuse - essentially touching a timestamp of dentry, be it new or
old one.  Both on ->lookup() and when seeding dcache on readdir.  Same
as done on successful revalidate, so the worst we are risking is extra work
in ->d_revalidate() for somebody finding it in dcache before we'd touched
the timestamp.
	* nfs - ditto.
	* 9p - the fid we'd obtained is hung on whatever dentry we get.
If revalidate doesn't find a fid, it'll try to get one from server and
hang it on the dentry.

d_obtain_alias() callers make it even more clear - 27 call sites where it's
directly returned, 5 more where it's returned without being dereferenced
and 5 callers in 3 filesystems (ceph, fuse, lustre) where something is
(or ought to be) done.  ceph might benefit from that thing as well - looks
like its users of d_obtain_alias() might be racy wrt d_splice_alias() from
->lookup() picking the alias before ceph_init_dentry() finishes.  Or not -
the call chain in there is convoluted as hell and I might miss the call
of ceph_init_denty() along the way.  ocfs2 might or might not be correct -
it doesn't do any postprocessing after d_obtain_alias(); no idea if it
really needs one there.

So we have two filesystems that could benefit from having ->d_init() -
ceph and lustre.  Note that e.g. ocfs2 wouldn't benefit from that - it needs
a lot more than just allocation *and* it uses ->d_fsdata differently for
positives and negatives.

FWIW, there's a troubling thing in ceph_init_dentry() -
        if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_NOSNAP)
                d_set_d_op(dentry, &ceph_dentry_ops);
        else if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_SNAPDIR)
                d_set_d_op(dentry, &ceph_snapdir_dentry_ops);
        else   
                d_set_d_op(dentry, &ceph_snap_dentry_ops);

_That_ can't be done in __d_alloc() and, what's worse, it doesn't look good
in their __fh_to_dentry() call of ceph_init_dentry().  ->d_parent has
a good chance to be pointing to the dentry itself.  Not sure how much trouble
does it lead to - I'm not familiar with ceph snapshots or ceph reexports over
NFS, let alone the combination...

> The downside is that we do the allocation at all times even if the dentry is
> going to be promptly destroyed because we found a better alias.

The cost of kmalloc() (or kmem_cache_alloc() - it might make sense
to introduce a specialized slab for those beasts) is trivial compared to the
cost of talking to server to find which inode you are going to deal with.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html