Re: [PATCH] nfsd: serialize state seqid morphing operations

Jeff Layton <jlayton@xxxxxxxxxxxxxxx> · Wed, 30 Sep 2015 06:53:38 -0400

On Tue, 29 Sep 2015 19:14:27 -0400
"J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote:

> On Tue, Sep 29, 2015 at 05:26:38PM -0400, Jeff Layton wrote:
> > On Tue, 29 Sep 2015 17:11:55 -0400
> > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote:
> > 
> > > On Thu, Sep 17, 2015 at 07:47:08AM -0400, Jeff Layton wrote:
> > > > Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
> > > > running in parallel. The server would receive the OPEN_DOWNGRADE first
> > > > and check its seqid, but then an OPEN would race in and bump it. The
> > > > OPEN_DOWNGRADE would then complete and bump the seqid again.  The result
> > > > was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
> > > > it should have been rejected since the seqid changed.
> > > > 
> > > > The only recourse we have here I think is to serialize operations that
> > > > bump the seqid in a stateid, particularly when we're given a seqid in
> > > > the call. To address this, we add a new rw_semaphore to the
> > > > nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
> > > > after looking up the stateid to ensure that nothing else is going to
> > > > bump it while we're operating on it.
> > > > 
> > > > In the case of OPEN, we do a down_read, as the call doesn't contain a
> > > > seqid. Those can run in parallel -- we just need to serialize them when
> > > > there is a concurrent OPEN_DOWNGRADE or CLOSE.
> > > > 
> > > > LOCK and LOCKU however always take the write lock as there is no
> > > > opportunity for parallelizing those.
> > > > 
> > > > Reported-and-Tested-by: Andrew W Elble <aweits@xxxxxxx>
> > > > Signed-off-by: Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
> > > > ---
> > > >  fs/nfsd/nfs4state.c | 33 ++++++++++++++++++++++++++++-----
> > > >  fs/nfsd/state.h     | 19 ++++++++++---------
> > > >  2 files changed, 38 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > index 0f1d5691b795..1b39edf10b67 100644
> > > > --- a/fs/nfsd/nfs4state.c
> > > > +++ b/fs/nfsd/nfs4state.c
> > > > @@ -3360,6 +3360,7 @@ static void init_open_stateid(struct nfs4_ol_stateid *stp, struct nfs4_file *fp,
> > > >  	stp->st_access_bmap = 0;
> > > >  	stp->st_deny_bmap = 0;
> > > >  	stp->st_openstp = NULL;
> > > > +	init_rwsem(&stp->st_rwsem);
> > > >  	spin_lock(&oo->oo_owner.so_client->cl_lock);
> > > >  	list_add(&stp->st_perstateowner, &oo->oo_owner.so_stateids);
> > > >  	spin_lock(&fp->fi_lock);
> > > > @@ -4187,15 +4188,20 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> > > >  	 */
> > > >  	if (stp) {
> > > >  		/* Stateid was found, this is an OPEN upgrade */
> > > > +		down_read(&stp->st_rwsem);
> > > >  		status = nfs4_upgrade_open(rqstp, fp, current_fh, stp, open);
> > > > -		if (status)
> > > > +		if (status) {
> > > > +			up_read(&stp->st_rwsem);
> > > >  			goto out;
> > > > +		}
> > > >  	} else {
> > > >  		stp = open->op_stp;
> > > >  		open->op_stp = NULL;
> > > >  		init_open_stateid(stp, fp, open);
> > > > +		down_read(&stp->st_rwsem);
> > > >  		status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open);
> > > >  		if (status) {
> > > > +			up_read(&stp->st_rwsem);
> > > >  			release_open_stateid(stp);
> > > >  			goto out;
> > > >  		}
> > > > @@ -4207,6 +4213,7 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
> > > >  	}
> > > >  	update_stateid(&stp->st_stid.sc_stateid);
> > > >  	memcpy(&open->op_stateid, &stp->st_stid.sc_stateid, sizeof(stateid_t));
> > > > +	up_read(&stp->st_rwsem);
> > > >  
> > > >  	if (nfsd4_has_session(&resp->cstate)) {
> > > >  		if (open->op_deleg_want & NFS4_SHARE_WANT_NO_DELEG) {
> > > 
> > > The patch looks good, but:
> > > 
> > > Does it matter that we don't have an exclusive lock over that
> > > update_stateid?
> > > 
> > > I think there's at least one small bug there:
> > > 
> > > 	static inline void update_stateid(stateid_t *stateid)
> > > 	{       
> > > 	        stateid->si_generation++;
> > > 	        /* Wraparound recommendation from 3530bis-13 9.1.3.2: */
> > > 	        if (stateid->si_generation == 0)
> > > 	                stateid->si_generation = 1;
> > > 	}
> > > 
> > > The si_generation increment isn't atomic, and even if it were the wraparound
> > > handling definitely wouldn't.  That's a pretty small race.
> > > 
> > 
> > Yeah, I eyeballed that some time ago and convinced myself it was ok,
> > but I think you're right that there is a potential race there. That
> > counter sort of seems like something that ought to use atomics in some
> > fashion. The wraparound is tricky, but could be done locklessly with
> > cmpxchg, I think...
> > > Does it also matter that this si_generation update isn't atomic with respect
> > > to the actual open and upgrade of the share bits?
> > > 
> > 
> > I don't think so. If you race with a concurrent OPEN upgrade, the
> > server will either have what you requested or a superset of that. It's
> > only OPEN_DOWNGRADE and CLOSE that need full serialization.
> 
> IO also takes stateids, and I think it'd be interesting to think about
> scenarios that involve concurrent opens and reads or writes.  For
> example:
> 
> 	- process_open2 upgrades R to RW
> 
> 				- Write processed with si_generation=1
> 
> 	- process_open2 bumps si_generation
> 
> The write succeeds, but it was performed with a stateid that represented
> a read-only open.  I believe that write should have gotten either
> OPENMODE (if it happened before the open) or OLD_STATEID (if it happened
> after).
> 
> I don't know if that's actually a problem in practice.
> 

That's a different issue altogether. We're not serializing anything but
operations that morph the seqid in this patch, and WRITE doesn't do
that.

If we need to hold the seqid static over the life of operations that
happened to provide that seqid then that's a much larger effort, and
probably not suitable for a rw semaphore. You'd need to allow for a 3rd
class of "locker" that is able to operate in parallel with one another
but that prevents any changes to the seqid.

> Hm, I also wonder about the window between the si_generation bump and
> memcpy.  Multiple concurrent opens could end up returning the same
> stateid:
> 
> 	- si_generation++
> 				- si_generation++
> 
> 	- memcpy new stateid
> 				- memcpy new stateid
> 
> Again, I'm not sure if that will actually cause application-visible
> problems.
> 

That, could be a problem.

One thing to notice is that update_stateid is always followed by the
memcpy of that stateid. Perhaps we should roll the two together --
update the stateid and do the memcpy under a spinlock or something.
That also would also make it trivial to fix the wraparound problem too.

Do you think we could get away with a global spinlock for that, or do
we need to consider adding one to the nfs4_stid?

-- 
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html