Re: upgrade/downgrade race

Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> · Sat, 12 Sep 2015 08:10:37 -0400

On Wed, 9 Sep 2015 17:39:07 -0400
Bruce James Fields <bfields@xxxxxxxxxxxx> wrote:

> On Wed, Sep 09, 2015 at 05:00:37PM -0400, Jeff Layton wrote:
> > On Wed, 9 Sep 2015 16:40:36 -0400
> > Bruce James Fields <bfields@xxxxxxxxxxxx> wrote:
> > 
> > > On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> > > > On Wed, 9 Sep 2015 15:01:54 -0400
> > > > Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> wrote:
> > > > > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > > > > Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > >> +Bruce, +Jeff...
> > > > > >>
> > > > > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > > > > >> <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
> > > > > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@xxxxxxx> wrote:
> > > > > >> >>
> > > > > >> >> In attempting to troubleshoot other issues, we've run into this race
> > > > > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > > > > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > > > > >> >>
> > > > > >> >> These all affect the same fh/stateid:
> > > > > >> >>
> > > > > >> >> 116 -> OPEN (will be an upgrade / for write)
> > > > > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > > > > >> >>
> > > > > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > > > > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > > > > >> >>
> > > > > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > > > > >> >> have write access.
> > > > > >> >>
> > > > > >> >> Any thoughts? I can share more data from the capture if needed.
> > > > > >> >
> > > > > >> > Bruce & Jeff,
> > > > > >> >
> > > > > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > > > >> > being executed after the OPEN here? Surely, if that is the case, the
> > > > > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > > > >> > OPEN_DOWNGRADE operation?
> > > > > >> >
> > > > > >
> > > > > > The problem there is that we do the seqid checks at the beginning of
> > > > > > the operation. In this case it's likely that it was 0x6 when the
> > > > > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > > > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > > > > the seqid we don't verify it against what came in originally.
> > > > > >
> > > > > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > > > > doesn't seem to explicitly require that we serialize such operations on
> > > > > > the server. The closest thing I can find is this in 3.3.12:
> > > > > 
> > > > > RFC5661, section 8.2.2
> > > > >   Except for layout stateids (Section 12.5.3), when a client sends a
> > > > >    stateid to the server, it has two choices with regard to the seqid
> > > > >    sent.  It may set the seqid to zero to indicate to the server that it
> > > > >    wishes the most up-to-date seqid for that stateid's "other" field to
> > > > >    be used.  This would be the common choice in the case of a stateid
> > > > >    sent with a READ or WRITE operation.  It also may set a non-zero
> > > > >    value, in which case the server checks if that seqid is the correct
> > > > >    one.  In that case, the server is required to return
> > > > >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> > > > >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> > > > >    value.  This would be the common choice in the case of stateids sent
> > > > >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> > > > >    parallel for the same owner, a client might close a file without
> > > > >    knowing that an OPEN upgrade had been done by the server, changing
> > > > >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> > > > >    upgrade would be cancelled before the client even received an
> > > > >    indication that an upgrade had happened.
> > > > > 
> > > > > The suggestion there is clearly that the client can rely on the server
> > > > > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > > > > OPEN. Otherwise, what is the difference between sending a non-zero
> > > > > seqid and zero?
> > > > > 
> > > > > > "The server is required to increment the "seqid" field by
> > > > > >  one at each transition of the stateid.  This is important since the
> > > > > >  client will inspect the seqid in OPEN stateids to determine the order
> > > > > >  of OPEN processing done by the server."
> > > > > >
> > > > > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > > > > >
> > > > > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > > > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > > > > locking around it), and then have some way to unwind all of the changes
> > > > > > if that fails. That may be impossible however -- we're likely closing
> > > > > > struct files after all.
> > > > > 
> > > > > Updates to the state are already required to be atomic. You can't have
> > > > > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > > > > 
> > > > > >
> > > > > > Now, all of that said, I think the client has some bugs in its seqid
> > > > > > handling as well. It should have realized that the stateid was a r/o
> > > > > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > > > > still issued a WRITE just afterward. That seems wrong.
> > > > > 
> > > > > No. The client is relying on the server not reordering the
> > > > > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > > > > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > > > > and for both operations to succeed.
> > > > > 
> > > > > Trond
> > > > 
> > > > In that case, the "simple" fix would be to add a mutex to
> > > > nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> > > > we unlock it after bumping the seqid (or on error).
> > > > 
> > > > Bruce, any thoughts?
> > > 
> > > Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
> > > already doing this with the so_replay.rp_mutex lock?
> > > 
> > > Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
> > > should be shared in the session case.
> > > 
> > 
> > Yeah, that's probably a bit heavyweight for v4.1. That mutex is in the
> > stateowner struct. The same stateowner could be opening different
> > files, and we wouldn't want to serialize those. I think we'd need
> > something in the stateid struct itself.
> > 
> > Trond also pointed out that we don't really need to serialize OPEN
> > calls, so we might be best off with something like a rw semaphore. Take
> > the read lock in OPEN, and the write lock for OPEN_DOWNGRADE/CLOSE.
> > LOCK/LOCKU will also need similar treatment of course.
> 
> OK, I think I agree.  LOCK and LOCKU both need exclusive locks, right?
> 
> > I'm not sure about LAYOUTGET/LAYOUTRETURN/CLOSE though.
> 
> Me neither.
> 
> --b.

Andrew, could you test this patch out? This just covers open and lock
stateids. If it works, I'll clean up the comments and resend it to the
list as a PATCH email.

Assuming that it does, we'll need to consider what (if anything) to do
about layout stateids...

-- 
Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
>From 55907bdf2270dd024e7079a0b2b4581e2f9edae4 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
Date: Fri, 11 Sep 2015 19:44:58 -0400
Subject: [PATCH] nfsd: serialize state seqid morphing operations

Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
running in parallel. The server would receive the OPEN_DOWNGRADE first
and check its seqid, but then an OPEN would race in and bump it. The
OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
it should have been rejected since the seqid changed.

The only recourse we have here I think is to serialize operations that
bump the seqid in a stateid, particularly when we're given a seqid in
the call. To address this, we add a new rw_semaphore to the
nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
after looking up the stateid to ensure that nothing else is going to
bump it while we're operating on it.

In the case of OPEN, we do a down_read, as the call doesn't contain a
seqid. Those can run in parallel -- we just need to serialize them when
there is a concurrent OPEN_DOWNGRADE or CLOSE.

LOCK and LOCKU however always take the write lock as there is no
opportunity for parallelizing those.

Signed-off-by: Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
---
 fs/nfsd/nfs4state.c | 33 ++++++++++++++++++++++++++++-----
 fs/nfsd/state.h     | 19 ++++++++++---------
 2 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 0f1d5691b795..1b39edf10b67 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3360,6 +3360,7 @@ static void init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = 0;
 	stp->st_openstp = NULL;
+	init_rwsem(&stp->st_rwsem);
 	spin_lock(&oo->oo_owner.so_client->cl_lock);
 	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
 	spin_lock(&fp->fi_lock);
@@ -4187,15 +4188,20 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	 */
 	if (stp) {
 		/* Stateid was found, this is an OPEN upgrade */
+		down_read(&stp->st_rwsem);
 		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
-		if (status)
+		if (status) {
+			up_read(&stp->st_rwsem);
 			goto out;
+		}
 	} else {
 		stp = open->op_stp;
 		open->op_stp = NULL;
 		init_open_stateid(stp, fp, open);
+		down_read(&stp->st_rwsem);
 		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
 		if (status) {
+			up_read(&stp->st_rwsem);
 			release_open_stateid(stp);
 			goto out;
 		}
@@ -4207,6 +4213,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	}
 	update_stateid(&stp->st_stid.sc_stateid);
 	memcpy(&open->op_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
+	up_read(&stp->st_rwsem);
 
 	if (nfsd4_has_session(&resp->cstate)) {
 		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
@@ -4819,10 +4826,13 @@ static __be32 nfs4_seqid_op_checks(struct nfsd4_compound_state *cstate, stateid_
 		 * revoked delegations are kept only for free_stateid.
 		 */
 		return nfserr_bad_stateid;
+	down_write(&stp->st_rwsem);
 	status = check_stateid_generation(stateid, &stp->st_stid.sc_stateid, nfsd4_has_session(cstate));
-	if (status)
-		return status;
-	return nfs4_check_fh(current_fh, &stp->st_stid);
+	if (status == nfs_ok)
+		status = nfs4_check_fh(current_fh, &stp->st_stid);
+	if (status != nfs_ok)
+		up_write(&stp->st_rwsem);
+	return status;
 }
 
 /* 
@@ -4869,6 +4879,7 @@ static __be32 nfs4_preprocess_confirmed_seqid_op(struct nfsd4_compound_state *cs
 		return status;
 	oo = openowner(stp->st_stateowner);
 	if (!(oo->oo_flags & NFS4_OO_CONFIRMED)) {
+		up_write(&stp->st_rwsem);
 		nfs4_put_stid(&stp->st_stid);
 		return nfserr_bad_stateid;
 	}
@@ -4899,11 +4910,14 @@ nfsd4_open_confirm(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		goto out;
 	oo = openowner(stp->st_stateowner);
 	status = nfserr_bad_stateid;
-	if (oo->oo_flags & NFS4_OO_CONFIRMED)
+	if (oo->oo_flags & NFS4_OO_CONFIRMED) {
+		up_write(&stp->st_rwsem);
 		goto put_stateid;
+	}
 	oo->oo_flags |= NFS4_OO_CONFIRMED;
 	update_stateid(&stp->st_stid.sc_stateid);
 	memcpy(&oc->oc_resp_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
+	up_write(&stp->st_rwsem);
 	dprintk("NFSD: %s: success, seqid=%d stateid=" STATEID_FMT "\n",
 		__func__, oc->oc_seqid, STATEID_VAL(&stp->st_stid.sc_stateid));
 
@@ -4982,6 +4996,7 @@ nfsd4_open_downgrade(struct svc_rqst *rqstp,
 	memcpy(&od->od_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
 	status = nfs_ok;
 put_stateid:
+	up_write(&stp->st_rwsem);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
@@ -5035,6 +5050,7 @@ nfsd4_close(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		goto out; 
 	update_stateid(&stp->st_stid.sc_stateid);
 	memcpy(&close->cl_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
+	up_write(&stp->st_rwsem);
 
 	nfsd4_close_open_stateid(stp);
 
@@ -5260,6 +5276,7 @@ init_lock_stateid(struct nfs4_ol_stateid *stp, struct nfs4_lockowner *lo,
 	stp->st_access_bmap = 0;
 	stp->st_deny_bmap = open_stp->st_deny_bmap;
 	stp->st_openstp = open_stp;
+	init_rwsem(&stp->st_rwsem);
 	list_add(&stp->st_locks, &open_stp->st_locks);
 	list_add(&stp->st_perstateowner, &lo->lo_owner.so_stateids);
 	spin_lock(&fp->fi_lock);
@@ -5428,6 +5445,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 					&open_stp, nn);
 		if (status)
 			goto out;
+		up_write(&open_stp->st_rwsem);
 		open_sop = openowner(open_stp->st_stateowner);
 		status = nfserr_bad_stateid;
 		if (!same_clid(&open_sop->oo_owner.so_client->cl_clientid,
@@ -5435,6 +5453,8 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 			goto out;
 		status = lookup_or_create_lock_state(cstate, open_stp, lock,
 							&lock_stp, &new);
+		if (status == nfs_ok)
+			down_write(&lock_stp->st_rwsem);
 	} else {
 		status = nfs4_preprocess_seqid_op(cstate,
 				       lock->lk_old_lock_seqid,
@@ -5540,6 +5560,8 @@ out:
 		    seqid_mutating_err(ntohl(status)))
 			lock_sop->lo_owner.so_seqid++;
 
+		up_write(&lock_stp->st_rwsem);
+
 		/*
 		 * If this is a new, never-before-used stateid, and we are
 		 * returning an error, then just go ahead and release it.
@@ -5709,6 +5731,7 @@ nfsd4_locku(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 fput:
 	fput(filp);
 put_stateid:
+	up_write(&stp->st_rwsem);
 	nfs4_put_stid(&stp->st_stid);
 out:
 	nfsd4_bump_seqid(cstate, status);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 583ffc13cae2..31bde12feefe 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -534,15 +534,16 @@ struct nfs4_file {
  * Better suggestions welcome.
  */
 struct nfs4_ol_stateid {
-	struct nfs4_stid    st_stid; /* must be first field */
-	struct list_head              st_perfile;
-	struct list_head              st_perstateowner;
-	struct list_head              st_locks;
-	struct nfs4_stateowner      * st_stateowner;
-	struct nfs4_clnt_odstate    * st_clnt_odstate;
-	unsigned char                 st_access_bmap;
-	unsigned char                 st_deny_bmap;
-	struct nfs4_ol_stateid         * st_openstp;
+	struct nfs4_stid		st_stid;
+	struct list_head		st_perfile;
+	struct list_head		st_perstateowner;
+	struct list_head		st_locks;
+	struct nfs4_stateowner		*st_stateowner;
+	struct nfs4_clnt_odstate	*st_clnt_odstate;
+	unsigned char			st_access_bmap;
+	unsigned char			st_deny_bmap;
+	struct nfs4_ol_stateid		*st_openstp;
+	struct rw_semaphore		st_rwsem;
 };
 
 static inline struct nfs4_ol_stateid *openlockstateid(struct nfs4_stid *s)
-- 
2.4.3