Thanks for taking an initial look. Comments below regarding the issue you mentioned about uninterruptible sleep, NFS server going away, etc. On Tue, 2013-02-19 at 08:59 -0500, Dave Wysochanski wrote: > Commit 73ca100 broke the code that prevents the client from deleting > a silly renamed dentry. This affected "delete on last close" > semantics as after that commit, nothing prevented removal of > silly-renamed files. As a result, a process holding a file open > could easily get an ESTALE on the file in a directory where some > other process issued 'rm -rf some_dir_containing_the_file' twice. > Before the commit, any attempt at unlinking silly renamed files would > fail inside may_delete() with -EBUSY because of the > DCACHE_NFSFS_RENAMED flag. The following testcase demonstrates > the problem: > tail -f /nfsmnt/dir/file & > rm -rf /nfsmnt/dir > rm -rf /nfsmnt/dir > # second removal does not fail, 'tail' process receives ESTALE > > The problem with the above commit is that it unhashes the old and > new dentries from the lookup path, even in the normal case when > a signal is not encountered and it would have been safe to call > d_move. Unfortunately the old dentry has the special > DCACHE_NFSFS_RENAMED flag set on it. Unhashing has the > side-effect that future lookups call d_alloc(), allocating a new > dentry without the special flag for any silly-renamed files. As a > result, subsequent calls to unlink silly renamed files do not fail > but allow the removal to go through. This will result in ESTALE > errors for any other process doing operations on the file. > > To fix this, go back to using d_move. However, to use d_move safely, > we have to isolate the special case where we receive a signal. > To handle this, we add two flags to nfs_renamedata, > NFS_RENAME_CANCELLED and NFS_RENAME_CALLBACK_IN_PROGRESS. > > If a signal occurs, inside nfs_sillyrename() we first set > NFS_RENAME_CANCELLED and then wait for NFS_RENAME_CALLBACK_IN_PROGRESS > to clear. Inside nfs_async_rename_done(), we set > NFS_RENAME_CALLBACK_IN_PROGRESS at the top and clear at the bottom, > waking up anyone sleeping on the bit. Before calling d_move, we > check NFS_RENAME_CANCELLED and if set, we d_drop instead of calling > d_move. If this case is hit, and we have to d_drop, a process will > still receive an ESTALE, but the non-signal case is not affected. > > For the signal case, it's unclear what we may safely do beyond d_drop > as we've dropped the mutexes, and to call d_move we must reacquire > them. I made an attempt at reacquiring mutexes and then checking > for various scenarios, but the testcases and the code quickly became > fairly complex and seemed like the wrong approach. > > Signed-off-by: Dave Wysochanski <dwysocha@xxxxxxxxxx> > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> > --- > fs/nfs/unlink.c | 39 +++++++++++++++++++++++++++++++++++++-- > include/linux/nfs_xdr.h | 3 +++ > 2 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c > index 3f79c77..4e2d81d 100644 > --- a/fs/nfs/unlink.c > +++ b/fs/nfs/unlink.c > @@ -323,6 +323,19 @@ nfs_cancel_async_unlink(struct dentry *dentry) > spin_unlock(&dentry->d_lock); > } > > +static int nfs_async_rename_wait_bit(void *flags) > +{ > + schedule(); > + return 0; > +} > +static inline void nfs_async_rename_wake_up(struct nfs_renamedata *data) > +{ > + smp_mb__before_clear_bit(); > + clear_bit(NFS_RENAME_CALLBACK_IN_PROGRESS, &data->flags); > + smp_mb__after_clear_bit(); > + wake_up_bit(&data->flags, NFS_RENAME_CALLBACK_IN_PROGRESS); > +} > + > /** > * nfs_async_rename_done - Sillyrename post-processing > * @task: rpc_task of the sillyrename > @@ -338,18 +351,32 @@ static void nfs_async_rename_done(struct rpc_task *task, void *calldata) > struct dentry *old_dentry = data->old_dentry; > struct dentry *new_dentry = data->new_dentry; > > + set_bit(NFS_RENAME_CALLBACK_IN_PROGRESS, &data->flags); > + /* ensure the read of NFS_RENAME_CANCELLED is not reordered */ > + smp_mb(); > + > if (!NFS_PROTO(old_dir)->rename_done(task, old_dir, new_dir)) { > rpc_restart_call_prepare(task); > + nfs_async_rename_wake_up(data); > return; > } > > if (task->tk_status != 0) { > nfs_cancel_async_unlink(old_dentry); > + nfs_async_rename_wake_up(data); > + return; > + } > + > + if (test_bit(NFS_RENAME_CANCELLED, &data->flags)) { > + d_drop(old_dentry); > + d_drop(new_dentry); > + nfs_async_rename_wake_up(data); > return; > } > > - d_drop(old_dentry); > - d_drop(new_dentry); > + nfs_set_verifier(data->old_dentry, nfs_save_change_attribute(old_dir)); > + d_move(data->old_dentry, data->new_dentry); > + nfs_async_rename_wake_up(data); > } > > /** > @@ -483,6 +510,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry) > struct dentry *sdentry; > struct rpc_task *task; > int error = -EIO; > + struct nfs_renamedata *data; > > dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n", > dentry->d_parent->d_name.name, dentry->d_name.name, > @@ -550,6 +578,13 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry) > error = rpc_wait_for_completion_task(task); > if (error == 0) > error = task->tk_status; > + else if (error == -ERESTARTSYS) /* We got a signal */ { > + data = (struct nfs_renamedata *)task->tk_calldata; > + set_bit(NFS_RENAME_CANCELLED, &data->flags); > + /* barrier inside wait_on_bit before we read the bit */ > + wait_on_bit(&data->flags, NFS_RENAME_CALLBACK_IN_PROGRESS, > + nfs_async_rename_wait_bit, TASK_UNINTERRUPTIBLE); > + } I think you were mentioning concern about the above area, and the fact we may end up in an uninterruptible sleep if the NFS server goes away at just the wrong time. If I've misunderstood your concern, please clarify. First, consider the fact that NFS_RENAME_CALLBACK_IN_PROGRESS will only be set if we have received the reply from the server, and have started processing nfs_async_rename_done. So we could get stuck, but it would be somewhere inside that function. Second, note that the possibility of the process inside nfs_sillyrename() dropping the i_mutex as a result of a signal was the reason Jeff did the original commit, which avoided doing d_move in favor of d_drop. I can think of a couple things we could do: 1. Move the setting of NFS_RENAME_CALLBACK_IN_PROGRESS just before the test for NFS_RENAME_CANCELLED and before the d_move, i.e. here: + if (test_bit(NFS_RENAME_CANCELLED, &data->flags)) { This is really all we are concerned about anyway, and why we want to stop the process inside nfs_sillyrename() if we've received a signal - we don't want the possibility that the caller could drop the i_mutex. We would still be vulnerable to sleeping UNINTERRUPTIBLY inside d_move but I think this is required - bad things can happen if we don't. 2. Change the UNINTERRUPTIBLE sleep to an INTERRUPTIBLE. However, if we do that, and we break out of that wait_on_bit, we'll need to make sure we're not stuck inside d_move(), which requires the i_mutex be held. Since we had received the reply from the NFS server by the time we go to UNINTERRUPTIBLE sleep, I did not think an UNINTERRUPTIBLE sleep there was unreasonable. If it is, we can try one of the above modifications to the patch. > rpc_put_task(task); > out_dput: > dput(sdentry); > diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h > index 29adb12..136e3d4 100644 > --- a/include/linux/nfs_xdr.h > +++ b/include/linux/nfs_xdr.h > @@ -1339,6 +1339,8 @@ struct nfs_unlinkdata { > struct nfs_fattr dir_attr; > }; > > +#define NFS_RENAME_CANCELLED (1U << 0) > +#define NFS_RENAME_CALLBACK_IN_PROGRESS (1U << 1) > struct nfs_renamedata { > struct nfs_renameargs args; > struct nfs_renameres res; > @@ -1349,6 +1351,7 @@ struct nfs_renamedata { > struct inode *new_dir; > struct dentry *new_dentry; > struct nfs_fattr new_fattr; > + unsigned long flags; > }; > > struct nfs_access_entry; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html