Linux-NFS responds to NFS4ERR_OLD_STATEID by simply retrying the request, hoping to make use of an updated stateid that might have arrived from the server. This is usually successful. However if the client and server get out-of-sync for some reason and if there is no new stateid to try, this can result in an indefinite loop which looks a bit like a DoS attack. This can particularly happen when a server replies with success to an OPEN request, but fails a subsequent GETATTR. This has been observed with Netapp and Hitachi servers when a concurrent unlink from a different client removes the file between the OPEN and the GETATTR. The GETATTR returns NFS4ERR_STALE. In this situation Linux-NFS ignores the successful open and in particular does not record the new stateid. If the same file was already open, those active opens will now have an OLD_STATEID. When the Netapp/Hitachi servers receive subseqent WRITE requests with the STALE file handle and OLD stateid, they choose to report the OLD_STATEID error rather than the STALE error. This causes Linux-NFS to loop indefinitely. To protect against this or other scenarios that might result in old stateids being used, this patch limits the number of retry attempts when OLD_STATEID is received. The first 4 retries are sent immediately after an error. After that the standard timed back-off retry scheme is used until a 2-second delay is reached. After than the OLD_STATEID error is mapped to ESTALE so the write fails rather than hanging indefinitely. This pattern of server behaviour can be demonstrated - and so the patch can be tested - by modifying the Linux NFS server as follows: 1/ The second time a file is opened, unlink it. This simulates a race with another client, without needing to have a race: > --- a/fs/nfsd/nfs4proc.c > +++ b/fs/nfsd/nfs4proc.c > @@ -594,6 +594,12 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > if (reclaim && !status) > nn->somebody_reclaimed = true; > out: > + if (!status && open->op_stateid.si_generation > 1) { > + printk("Opening gen %d\n", (int)open->op_stateid.si_generation); > + vfs_unlink(mnt_user_ns(resfh->fh_export->ex_path.mnt), > + resfh->fh_dentry->d_parent->d_inode, > + resfh->fh_dentry, NULL); > + } > if (open->op_filp) { > fput(open->op_filp); > open->op_filp = NULL; 2/ When a GETATTR op is attempted on an unlinked file, return ESTALE > @@ -852,6 +858,11 @@ nfsd4_getattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > if (status) > return status; > > + if (cstate->current_fh.fh_dentry->d_inode->i_nlink == 0) { > + printk("Return Estale for unlinked file\n"); > + return nfserr_stale; > + } > + > if (getattr->ga_bmval[1] & NFSD_WRITEONLY_ATTRS_WORD1) > return nfserr_inval; Then mount the filesystem and Thread 1 Thread 2 open a file open the same file (will fail) write to that file I use this shell fragment, using 'sleep' for synchronisation. The use of /bin/echo ensures the write is flushed when /bin/echo closes the fd on exit. ( /bin/echo hello sleep 3 /bin/echo there ) > /import/A/afile & sleep 3 cat /import/A/afile Signed-off-by: NeilBrown <neilb@xxxxxxx> --- fs/nfs/nfs4proc.c | 21 +++++++++++++++++++-- include/linux/nfs_xdr.h | 2 ++ 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 40d749f29ed3..ee77bf42ec38 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -408,7 +408,7 @@ static long nfs4_update_delay(long *timeout) long ret; if (!timeout) return NFS4_POLL_RETRY_MAX; - if (*timeout <= 0) + if (*timeout <= NFS4_POLL_RETRY_MIN) *timeout = NFS4_POLL_RETRY_MIN; if (*timeout > NFS4_POLL_RETRY_MAX) *timeout = NFS4_POLL_RETRY_MAX; @@ -562,9 +562,24 @@ static int nfs4_do_handle_exception(struct nfs_server *server, return 0; case -NFS4ERR_RETRY_UNCACHED_REP: - case -NFS4ERR_OLD_STATEID: exception->retry = 1; break; + case -NFS4ERR_OLD_STATEID: + /* Simple retry is usually safe, but buggy + * servers can cause DoS - so be careful. + */ + if (exception->timeout > HZ*2) { + ret = -ESTALE; + } else if (exception->timeout < 5) { + /* 5 tries with no delay */ + exception->retry = 1; + exception->timeout += 1; + ret = 0; + } else { + exception->delay = 1; + ret = 0; + } + break; case -NFS4ERR_BADOWNER: /* The following works around a Linux server bug! */ case -NFS4ERR_BADNAME: @@ -5506,10 +5521,12 @@ static int nfs4_write_done_cb(struct rpc_task *task, .inode = hdr->inode, .state = hdr->args.context->state, .stateid = &hdr->args.stateid, + .timeout = hdr->timeout, }; task->tk_status = nfs4_async_handle_exception(task, NFS_SERVER(inode), task->tk_status, &exception); + hdr->timeout = exception.timeout; if (exception.retry) { rpc_restart_call_prepare(task); return -EAGAIN; diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h index e86cf6642d21..0c791e5b213c 100644 --- a/include/linux/nfs_xdr.h +++ b/include/linux/nfs_xdr.h @@ -1625,6 +1625,8 @@ struct nfs_pgio_header { unsigned int good_bytes; /* boundary of good data */ unsigned long flags; + long timeout; + /* * rpc data */ -- 2.39.1