Re: [PATCH] NFS: add barriers when testing for NFS_FSDATA_BLOCKED

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 29 May 2024 01:27:17 +0000

On Tue, 2024-05-28 at 11:19 +1000, NeilBrown wrote:
> On Tue, 28 May 2024, Trond Myklebust wrote:
> > On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote:
> > > 
> > > dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or
> > > renaming-over a file to ensure that no open succeeds while the
> > > NFS
> > > operation progressed on the server.
> > > 
> > > Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under -
> > > >d_lock
> > > after checking the refcount is not elevated.  Any attempt to open
> > > the
> > > file (through that name) will go through lookp_open() which will
> > > take
> > > ->d_lock while incrementing the refcount, we can be sure that
> > > once
> > > the
> > > new value is set, __nfs_lookup_revalidate() *will* see the new
> > > value
> > > and
> > > will block.
> > > 
> > > We don't have any locking guarantee that when we set ->d_fsdata
> > > to
> > > NULL,
> > > the wait_var_event() in __nfs_lookup_revalidate() will notice.
> > > wait/wake primitives do NOT provide barriers to guarantee order. 
> > > We
> > > must use smp_load_acquire() in wait_var_event() to ensure we look
> > > at
> > > an
> > > up-to-date value, and must use smp_store_release() before
> > > wake_up_var().
> > > 
> > > This patch adds those barrier functions and factors out
> > > block_revalidate() and unblock_revalidate() far clarity.
> > > 
> > > There is also a hypothetical bug in that if memory allocation
> > > fails
> > > (which never happens in practice) we might leave ->d_fsdata
> > > locked.
> > > This patch adds the missing call to unblock_revalidate().
> > > 
> > > Reported-and-tested-by: Richard Kojedzinszky
> > > <richard+debian+bugreport@xxxxxxxxx>
> > > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501
> > > Fixes: 3c59366c207e ("NFS: don't unhash dentry during
> > > unlink/rename")
> > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > ---
> > >  fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++---------------
> > >  1 file changed, 29 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index ac505671efbd..c91dc36d41cc 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry
> > > *dentry,
> > > unsigned int flags,
> > >  		if (parent != READ_ONCE(dentry->d_parent))
> > >  			return -ECHILD;
> > >  	} else {
> > > -		/* Wait for unlink to complete */
> > > +		/* Wait for unlink to complete - see
> > > unblock_revalidate() */
> > >  		wait_var_event(&dentry->d_fsdata,
> > > -			       dentry->d_fsdata !=
> > > NFS_FSDATA_BLOCKED);
> > > +			       smp_load_acquire(&dentry-
> > > >d_fsdata)
> > > +			       != NFS_FSDATA_BLOCKED);
> > 
> > Doesn't this end up being a reversed ACQUIRE+RELEASE as described
> > in
> > the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory-
> > barriers.txt?
> 
> I don't think so.  That section is talking about STORE operations
> which
> can move backwards through ACQUIRE and forwards through RELEASE.
> 
> Above we have a LOAD operation which mustn't move backwards through
> ACQUIRE.  Below there is a STORE operation which mustn't move
> forwards
> through a RELEASE.  Both of those are guaranteed.

It isn't necessary for the LOAD to move backwards through the ACQUIRE.
As I understand it, the point is that the ACQUIRE can move through the
RELEASE as per the following paragraph in that document:

            Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
            not imply a full memory barrier.  Therefore, the CPU's execution of the
            critical sections corresponding to the RELEASE and the ACQUIRE can cross,
            so that:

                    *A = a;
                    RELEASE M
                    ACQUIRE N
                    *B = b;

            could occur as:

                    ACQUIRE N, STORE *B, STORE *A, RELEASE M

This would presumably be why the function clear_and_wake_up_bit() needs
a full memory barrier on most architectures, instead of being just an
smp_wmb(). Is my understanding of this wrong?

> 
> > 
> > IOW: Shouldn't the above rather be using READ_ONCE()?
> > 
> > >  		parent = dget_parent(dentry);
> > >  		ret = reval(d_inode(parent), dentry, flags);
> > >  		dput(parent);
> > > @@ -1817,6 +1818,26 @@ static int nfs_lookup_revalidate(struct
> > > dentry
> > > *dentry, unsigned int flags)
> > >  	return __nfs_lookup_revalidate(dentry, flags,
> > > nfs_do_lookup_revalidate);
> > >  }
> > >  
> > > +static void block_revalidate(struct dentry *dentry)
> > > +{
> > > +	/* old devname - just in case */
> > > +	kfree(dentry->d_fsdata);
> > > +
> > > +	/* Any new reference that could lead to an open
> > > +	 * will take ->d_lock in lookup_open() -> d_lookup().
> > > +	 */
> > > +	lockdep_assert_held(&dentry->d_lock);
> > > +
> > > +	dentry->d_fsdata = NULL;
> > 
> > Why are you doing a barrier free change to dentry->d_fsdata here
> > when
> > you have the memory barrier protected change in
> > unblock_revalidate()?
> 
> Ouch. This should be
> 
> 	dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> 
> I messed that up when rearranging the code after testing.
> 
> This doesn't need a barrier because a spinlock is held.  We check the
> refcount under the spinlock and only proceed if there are no other
> references.  So if __nfs_lookup_revalidate gets called concurrently,
> it
> must have got a new reference, and that requires the same spinlock.
> So if it is called after this assignment, the spinlock will provide
> all
> needed barriers.
> 
> > 
> > > +}
> > > +
> > > +static void unblock_revalidate(struct dentry *dentry)
> > > +{
> > > +	/* store_release ensures wait_var_event() sees the
> > > update */
> > > +	smp_store_release(&dentry->d_fsdata, NULL);
> > 
> > Shouldn't this be a WRITE_ONCE(), for the same reason as above?
> 
> No, for the same reason as above.  WRITE_ONCE() doesn't provide any
> memory barriers, it only avoid compiler optimisations.  Here we
> really
> need the barrier on some CPUs.
> 
> Thanks for the review.
> 
> NeilBrown
> 
> > 
> > > +	wake_up_var(&dentry->d_fsdata);
> > > +}
> > > +
> > >  /*
> > >   * A weaker form of d_revalidate for revalidating just the
> > > d_inode(dentry)
> > >   * when we don't really care about the dentry name. This is
> > > called
> > > when a
> > > @@ -2501,15 +2522,12 @@ int nfs_unlink(struct inode *dir, struct
> > > dentry *dentry)
> > >  		spin_unlock(&dentry->d_lock);
> > >  		goto out;
> > >  	}
> > > -	/* old devname */
> > > -	kfree(dentry->d_fsdata);
> > > -	dentry->d_fsdata = NFS_FSDATA_BLOCKED;
> > > +	block_revalidate(dentry);
> > >  
> > >  	spin_unlock(&dentry->d_lock);
> > >  	error = nfs_safe_remove(dentry);
> > >  	nfs_dentry_remove_handle_error(dir, dentry, error);
> > > -	dentry->d_fsdata = NULL;
> > > -	wake_up_var(&dentry->d_fsdata);
> > > +	unblock_revalidate(dentry);
> > >  out:
> > >  	trace_nfs_unlink_exit(dir, dentry, error);
> > >  	return error;
> > > @@ -2616,8 +2634,7 @@ nfs_unblock_rename(struct rpc_task *task,
> > > struct nfs_renamedata *data)
> > >  {
> > >  	struct dentry *new_dentry = data->new_dentry;
> > >  
> > > -	new_dentry->d_fsdata = NULL;
> > > -	wake_up_var(&new_dentry->d_fsdata);
> > > +	unblock_revalidate(new_dentry);
> > >  }
> > >  
> > >  /*
> > > @@ -2679,11 +2696,6 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > >  		if (WARN_ON(new_dentry->d_flags &
> > > DCACHE_NFSFS_RENAMED) ||
> > >  		    WARN_ON(new_dentry->d_fsdata ==
> > > NFS_FSDATA_BLOCKED))
> > >  			goto out;
> > > -		if (new_dentry->d_fsdata) {
> > > -			/* old devname */
> > > -			kfree(new_dentry->d_fsdata);
> > > -			new_dentry->d_fsdata = NULL;
> > > -		}
> > >  
> > >  		spin_lock(&new_dentry->d_lock);
> > >  		if (d_count(new_dentry) > 2) {
> > > @@ -2705,7 +2717,7 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > >  			new_dentry = dentry;
> > >  			new_inode = NULL;
> > >  		} else {
> > > -			new_dentry->d_fsdata =
> > > NFS_FSDATA_BLOCKED;
> > > +			block_revalidate(new_dentry);
> > >  			must_unblock = true;
> > >  			spin_unlock(&new_dentry->d_lock);
> > >  		}
> > > @@ -2717,6 +2729,8 @@ int nfs_rename(struct mnt_idmap *idmap,
> > > struct
> > > inode *old_dir,
> > >  	task = nfs_async_rename(old_dir, new_dir, old_dentry,
> > > new_dentry,
> > >  				must_unblock ?
> > > nfs_unblock_rename :
> > > NULL);
> > >  	if (IS_ERR(task)) {
> > > +		if (must_unblock)
> > > +			unblock_revalidate(new_dentry);
> > >  		error = PTR_ERR(task);
> > >  		goto out;
> > >  	}
> > 
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx