Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

Oleg Drokin <green@xxxxxxxxxxxxxx> · Sun, 3 Jul 2016 23:55:09 -0400

On Jul 3, 2016, at 11:08 PM, Al Viro wrote:

> On Sun, Jul 03, 2016 at 08:37:22PM -0400, Oleg Drokin wrote:
> 
>> Hm… This dates to sometime in 2006 and my memory is a bit hazy here.
>> 
>> I think when we called into the open, it went into fifo open and stuck there
>> waiting for the other opener. Something like that. And we cannot really be stuck here
>> because we are holding some locks that need to be released in predictable time.
>> 
>> This code is actually unreachable now because the server never returns an openhandle
>> for special device nodes anymore (there's a comment about it in current staging tree,
>> but I guess you are looking at some prior version).
>> 
>> I imagine device nodes might have represented a similar risk too, but it did not
>> occur to me to test it separately and the testsuite does not do it either.
>> 
>> Directories do not get stuck when you open them so they are ok and we can
>> atomically open them too, I guess.
>> Symlinks are handled specially on the server and the open never returns
>> the actual open handle for those, so this path is also unreachable with those.
> 
> Hmm...  How much does the safety of client depend upon the correctness of
> server?

Quite a bit, actually. If you connect to an rogue Lustre server,
currently there are many ways it can crash the client.
I suspect this is true not just of Lustre, if e.g. NFS server starts to
send directory inodes with duplicated inode numbers or some such,
VFS would not be super happy about such "hardlinked" directories either.
This is before we even consider that it can feed you garbage data
to crash your apps (or substitute binaries to do something else).

> BTW, there's a fun issue in ll_revalidate_dentry(): there's nothing to
> promise stability of ->d_parent in there, so uses of d_inode(dentry->d_parent)

Yes, we actually had a discussion about that in March, we were not the only ones
affected, and I think it was decided that dget_parent() was a better solution
to get to the parent (I see ext4 has already converted).
I believe you cannot hit it in Lustre now due to Lustre locking magic, but
I'll create a patch to cover this anyway. Thanks for reminding me about this.

> are not safe.  That's independent from parallel lookups, and it's hard
> to hit, but AFAICS it's not impossible to oops there.
> 
> Anyway, for Lustre the analogue of that NFS problem is here:
>        } else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
>                   !it_disposition(it, DISP_OPEN_CREATE)) {
>                /* With DISP_OPEN_CREATE dentry will be
>                 * instantiated in ll_create_it.
>                 */
>                LASSERT(!d_inode(*de));
>                d_instantiate(*de, inode);
>        }

Hm… Do you mean that when we do come hashed here, with a negative dentry
and positive disposition and hit the assertion about inode not being NULL
(still staying negative, basically)?
This one we cannot hit because negative dentries are protected by a Lustre
dlm lock held by the parent directory. Any create in that parent directory
would invalidate the lock and once that happens, all negative dentries would
be killed.
Hmm… This probably means this is a dead code?
Ah, I guess it's not.
If we do a lookup and find this negative dentry (from 2+ threads) and THEN it gets invalidated and our two threads both race to instantiate it...
It does look like something that is quite hard to hit, but still looks like a race
that could happen.

> AFAICS, this (on top of mainline) ought to work:

Thanks, I'll give this a try.
> 
> diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
> index 5eba0eb..b8da5b4 100644
> --- a/drivers/staging/lustre/lustre/llite/namei.c
> +++ b/drivers/staging/lustre/lustre/llite/namei.c
> @@ -581,9 +581,11 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 			  struct file *file, unsigned open_flags,
> 			  umode_t mode, int *opened)
> {
> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
> 	struct lookup_intent *it;
> 	struct dentry *de;
> 	long long lookup_flags = LOOKUP_OPEN;
> +	bool switched = false;
> 	int rc = 0;
> 
> 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir="DFID"(%p),file %p,open_flags %x,mode %x opened %d\n",
> @@ -603,11 +605,28 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 	it->it_flags = (open_flags & ~O_ACCMODE) | OPEN_FMODE(open_flags);
> 
> 	/* Dentry added to dcache tree in ll_lookup_it */
> +	if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) {
> +		d_drop(dentry);
> +		switched = true;
> +	        dentry = d_alloc_parallel(dentry->d_parent,
> +					  &dentry->d_name, &wq);
> +		if (IS_ERR(dentry)) {
> +			rc = PTR_ERR(dentry);
> +			goto out_release;
> +		}
> +		if (unlikely(!d_in_lookup(dentry))) {
> +			rc = finish_no_open(file, dentry);
> +			goto out_release;
> +		}
> +	}
> +
> 	de = ll_lookup_it(dir, dentry, it, lookup_flags);
> 	if (IS_ERR(de))
> 		rc = PTR_ERR(de);
> 	else if (de)
> 		dentry = de;
> +	else if (switched)
> +		de = dget(dentry);
> 
> 	if (!rc) {
> 		if (it_disposition(it, DISP_OPEN_CREATE)) {
> @@ -648,6 +667,10 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 	}
> 
> out_release:
> +	if (unlikely(switched)) {
> +		d_lookup_done(dentry);
> +		dput(dentry);
> +	}
> 	ll_intent_release(it);
> 	kfree(it);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html