On Wed, 2022-07-06 at 15:10 +1000, NeilBrown wrote: > NFS unlink() must determine if the file is open, and must perform a > "silly rename" instead of an unlink if it is. Otherwise the client > might hold a file open which has been removed on the server. > > Consequently if it determines that the file isn't open, it must block > any subsequent opens until the unlink has been completed on the server. > > This is currently achieved by unhashing the dentry. This forces any > open attempt to the slow-path for lookup which will block on i_sem on > the directory until the unlink completes. A proposed patch will change > the VFS to only get a shared lock on i_sem for unlink, so this will no > longer work. > > Instead we introduce an explicit interlock. A flag is set on the dentry > while the unlink is running and ->d_revalidate blocks while that flag is > set. This closes the race without requiring exclusion on i_sem. > unlink will still have exclusion on the dentry being unlinked, so it > will be safe to set and then clear the flag without any risk of another > thread touching the flag. > > There is little room for adding new dentry flags, so instead of adding a > new flag, we overload an existing flag which is not used by NFS. > > DCACHE_DONTCACHE is only set for filesystems which call > d_mark_dontcache() and NFS never calls this, so it is currently unused > in NFS. > DCACHE_DONTCACHE is only tested when the last reference on a dentry has > been dropped, so it is safe for NFS to set and then clear the flag while > holding a reference - the setting of the flag cannot cause a > misunderstanding. > > So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE > and add a definition to nfs_fs.h so that if NFS ever does find a need to > call d_mark_dontcache() the build will fail with a suitable error. > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > --- > > Hi Trond/Anna, > this patch is a precursor for my parallel-directory-updates patch set. > I would be particularly helpful if this (and the nfsd patches I > recently sent) could land for the next merge window. Then I could post > a substantially reduced series to implement parallel directory > updates, which would then be easier for other to review. > > Thanks, > NeilBrown > > > fs/nfs/dir.c | 23 ++++++++++++++++------- > include/linux/nfs_fs.h | 14 ++++++++++++++ > 2 files changed, 30 insertions(+), 7 deletions(-) > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 0c4e8dd6aa96..695bb057cbd2 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags, > int ret; > > if (flags & LOOKUP_RCU) { > + if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK) > + return -ECHILD; > parent = READ_ONCE(dentry->d_parent); > dir = d_inode_rcu(parent); > if (!dir) > @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags, > if (parent != READ_ONCE(dentry->d_parent)) > return -ECHILD; > } else { > + /* Wait for unlink to complete */ > + wait_var_event(&dentry->d_flags, > + !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)); > parent = dget_parent(dentry); > ret = reval(d_inode(parent), dentry, flags); > dput(parent); > @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry) > int nfs_unlink(struct inode *dir, struct dentry *dentry) > { > int error; > - int need_rehash = 0; > > dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id, > dir->i_ino, dentry); > @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry) > error = nfs_sillyrename(dir, dentry); > goto out; > } > - if (!d_unhashed(dentry)) { > - __d_drop(dentry); > - need_rehash = 1; > - } > + /* We must prevent any concurrent open until the unlink > + * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK > + * to clear. We set it here to ensure no lookup succeeds until > + * the unlink is complete on the server. > + */ > + dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK; > + > spin_unlock(&dentry->d_lock); > error = nfs_safe_remove(dentry); > nfs_dentry_remove_handle_error(dir, dentry, error); > - if (need_rehash) > - d_rehash(dentry); > + spin_lock(&dentry->d_lock); > + dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK; > + spin_unlock(&dentry->d_lock); > + wake_up_var(&dentry->d_flags); > out: > trace_nfs_unlink_exit(dir, dentry, error); > return error; > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h > index a17c337dbdf1..041a6076e045 100644 > --- a/include/linux/nfs_fs.h > +++ b/include/linux/nfs_fs.h > @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid) > > #define NFS_JUKEBOX_RETRY_TIME (5 * HZ) > > +/* We need to block new opens while a file is being unlinked. > + * If it is opened *before* we decide to unlink, we will silly-rename > + * instead. If it is opened *after*, then we the open to fail unless it creates "then we allow the open to fail" > + * a new file. > + * If we allow the open and unlink to race, we could end up with a file that is > + * open but deleted on the server resulting in ESTALE. > + * So overload DCACHE_DONTCACHE to record when the unlink is happening > + * and block dentry revalidation while it is set. > + * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache() > + * which NFS never calls. It is only tested on a dentry on which all references > + * have been dropped, so it is safe for NFS to set it while holding a reference. > + */ > +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE > +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()") > > # undef ifdebug > # ifdef NFS_DEBUG Wow, we really are out of dentry flags. I wonder if some of them are no longer needed? This overloading is a bit klunky but it's probably OK. AFAICT, 0x80000000 is still available though if this turns out to be too nasty. It looks like 0x08000000 may also be free Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>