On Thu, 07 Jul 2022, Jeff Layton wrote: > On Wed, 2022-07-06 at 15:10 +1000, NeilBrown wrote: > > NFS unlink() must determine if the file is open, and must perform a > > "silly rename" instead of an unlink if it is. Otherwise the client > > might hold a file open which has been removed on the server. > > > > Consequently if it determines that the file isn't open, it must block > > any subsequent opens until the unlink has been completed on the server. > > > > This is currently achieved by unhashing the dentry. This forces any > > open attempt to the slow-path for lookup which will block on i_sem on > > the directory until the unlink completes. A proposed patch will change > > the VFS to only get a shared lock on i_sem for unlink, so this will no > > longer work. > > > > Instead we introduce an explicit interlock. A flag is set on the dentry > > while the unlink is running and ->d_revalidate blocks while that flag is > > set. This closes the race without requiring exclusion on i_sem. > > unlink will still have exclusion on the dentry being unlinked, so it > > will be safe to set and then clear the flag without any risk of another > > thread touching the flag. > > > > There is little room for adding new dentry flags, so instead of adding a > > new flag, we overload an existing flag which is not used by NFS. > > > > DCACHE_DONTCACHE is only set for filesystems which call > > d_mark_dontcache() and NFS never calls this, so it is currently unused > > in NFS. > > DCACHE_DONTCACHE is only tested when the last reference on a dentry has > > been dropped, so it is safe for NFS to set and then clear the flag while > > holding a reference - the setting of the flag cannot cause a > > misunderstanding. > > > > So we define DCACHE_NFS_PENDING_UNLINK as an alias for DCACHE_DONTCACHE > > and add a definition to nfs_fs.h so that if NFS ever does find a need to > > call d_mark_dontcache() the build will fail with a suitable error. > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > --- > > > > Hi Trond/Anna, > > this patch is a precursor for my parallel-directory-updates patch set. > > I would be particularly helpful if this (and the nfsd patches I > > recently sent) could land for the next merge window. Then I could post > > a substantially reduced series to implement parallel directory > > updates, which would then be easier for other to review. > > > > Thanks, > > NeilBrown > > > > > > fs/nfs/dir.c | 23 ++++++++++++++++------- > > include/linux/nfs_fs.h | 14 ++++++++++++++ > > 2 files changed, 30 insertions(+), 7 deletions(-) > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > index 0c4e8dd6aa96..695bb057cbd2 100644 > > --- a/fs/nfs/dir.c > > +++ b/fs/nfs/dir.c > > @@ -1778,6 +1778,8 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags, > > int ret; > > > > if (flags & LOOKUP_RCU) { > > + if (dentry->d_flags & DCACHE_NFS_PENDING_UNLINK) > > + return -ECHILD; > > parent = READ_ONCE(dentry->d_parent); > > dir = d_inode_rcu(parent); > > if (!dir) > > @@ -1786,6 +1788,9 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags, > > if (parent != READ_ONCE(dentry->d_parent)) > > return -ECHILD; > > } else { > > + /* Wait for unlink to complete */ > > + wait_var_event(&dentry->d_flags, > > + !(dentry->d_flags & DCACHE_NFS_PENDING_UNLINK)); > > parent = dget_parent(dentry); > > ret = reval(d_inode(parent), dentry, flags); > > dput(parent); > > @@ -2454,7 +2459,6 @@ static int nfs_safe_remove(struct dentry *dentry) > > int nfs_unlink(struct inode *dir, struct dentry *dentry) > > { > > int error; > > - int need_rehash = 0; > > > > dfprintk(VFS, "NFS: unlink(%s/%lu, %pd)\n", dir->i_sb->s_id, > > dir->i_ino, dentry); > > @@ -2469,15 +2473,20 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry) > > error = nfs_sillyrename(dir, dentry); > > goto out; > > } > > - if (!d_unhashed(dentry)) { > > - __d_drop(dentry); > > - need_rehash = 1; > > - } > > + /* We must prevent any concurrent open until the unlink > > + * completes. ->d_revalidate will wait for DCACHE_NFS_PENDING_UNLINK > > + * to clear. We set it here to ensure no lookup succeeds until > > + * the unlink is complete on the server. > > + */ > > + dentry->d_flags |= DCACHE_NFS_PENDING_UNLINK; > > + > > spin_unlock(&dentry->d_lock); > > error = nfs_safe_remove(dentry); > > nfs_dentry_remove_handle_error(dir, dentry, error); > > - if (need_rehash) > > - d_rehash(dentry); > > + spin_lock(&dentry->d_lock); > > + dentry->d_flags &= ~DCACHE_NFS_PENDING_UNLINK; > > + spin_unlock(&dentry->d_lock); > > + wake_up_var(&dentry->d_flags); > > out: > > trace_nfs_unlink_exit(dir, dentry, error); > > return error; > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h > > index a17c337dbdf1..041a6076e045 100644 > > --- a/include/linux/nfs_fs.h > > +++ b/include/linux/nfs_fs.h > > @@ -617,6 +617,20 @@ nfs_fileid_to_ino_t(u64 fileid) > > > > #define NFS_JUKEBOX_RETRY_TIME (5 * HZ) > > > > +/* We need to block new opens while a file is being unlinked. > > + * If it is opened *before* we decide to unlink, we will silly-rename > > + * instead. If it is opened *after*, then we the open to fail unless it creates > > "then we allow the open to fail" Actually it is "then we need the open to fail". I should probably do a complete re-write of that para. > > > + * a new file. > > + * If we allow the open and unlink to race, we could end up with a file that is > > + * open but deleted on the server resulting in ESTALE. > > + * So overload DCACHE_DONTCACHE to record when the unlink is happening > > + * and block dentry revalidation while it is set. > > + * DCACHE_DONTCACHE is only used by filesystems which call d_mark_dontcache() > > + * which NFS never calls. It is only tested on a dentry on which all references > > + * have been dropped, so it is safe for NFS to set it while holding a reference. > > + */ > > +#define DCACHE_NFS_PENDING_UNLINK DCACHE_DONTCACHE > > +#define d_mark_dontcache(i) BUILD_BUG_ON_MSG(1, "NFS cannot use d_mark_dontcache()") > > > > # undef ifdebug > > # ifdef NFS_DEBUG > > Wow, we really are out of dentry flags. I wonder if some of them are no > longer needed? > > This overloading is a bit klunky but it's probably OK. AFAICT, > 0x80000000 is still available though if this turns out to be too nasty. > It looks like 0x08000000 may also be free I need one of those two in a subsequent patch to lock a dentry while the name/link is being created. If I used the other for NFS_PENDING_UNLINK we would be completely out. This flag really should be completely private to nfs. d_fsdata would be the best place to put it. But NFS doesn't have a permanent d_fsdata in which I can store a bit. Nor does it leave d_fsdata untouched, so I cannot store a magic value in there. There are two different uses of d_fsdata. I don't fully understand when they are active, so I don't know if it is safe to add another independent use - I suspect not though. > > > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> > Thanks, NeilBrown