Re: [PATCH] nfsd: CB_RECALL can race with FREE_STATEID

Scott Mayhew <smayhew@xxxxxxxxxx> · Tue, 30 Apr 2019 14:46:14 -0400

On Thu, 18 Apr 2019, Trond Myklebust wrote:

> On Thu, 2019-04-18 at 16:50 -0400, Scott Mayhew wrote:
> > On Thu, 18 Apr 2019, J. Bruce Fields wrote:
> > 
> > > On Thu, Apr 18, 2019 at 09:24:00AM -0400, Scott Mayhew wrote:
> > > > While trying to track down some issues involving large numbers of
> > > > delegations being recalled/revoked, I caught the server setting
> > > > SEQ4_STATUS_CB_PATH_DOWN while the client was actively responding
> > > > to
> > > > CB_RECALLs.  It turns out that the client had already done a
> > > > TEST_STATEID and FREE_STATEID for a delegation being recalled by
> > > > the
> > > > time it received the CB_RECALL.
> > > 
> > > That's interesting, thanks!
> > > 
> > > This exception seems awfully narrow, though.
> > > 
> > > If we get back any NFS-level error at all, then I think the
> > > callback
> > > channel is working (am I wrong?)
> > 
> > Correct, if the client replies with either NFS4ERR_DELAY or
> > NFS4ERR_BAD_STATEID, the server will retry 1 time (see dl_retries).
> > After that, we fall thru and nfsd4_cb_recall_done() returns -1 which
> > causes the SEQ4_STATUS_CB_PATH_DOWN flag to be set.
> 
> There is no handling of NFS4ERR_DELAY in nfsd4_cb_recall_done().
> 
> As far as I can see, therefore, if the client returns NFS4ERR_DELAY
> (which it usually does if it is already in the process of returning the
> delegation) then the recall will fail immediately.

You're right, I had NFS4ERR_DELAY on the brain because I was seeing
those periodically in conjunction with the BIND_CONN_TO_SESSION calls
that were occurring while handling the bogus CB_PATH_DOWN flags from the
server.

-Scott
> 
> > > and telling the client to set up a new
> > > one is probably not going to help.  The best we can do is probably
> > > just
> > > give up
> > 
> > That's what the patch is essentially doing.  Or are you saying don't
> > even bother with the checks but still return 1 so we don't set the
> > SEQ4_STATUS_CB_PATH_DOWN flag?
> > 
> > > and let the client deal with the ensuing
> > > RECALLABLE_STATE_REVOKED flag.
> > 
> > The client's already dealing with the RECALLABLE_STATE_REVOKED flag,
> > that's why it sent a TEST_STATEID and FREE_STATEID before it got this
> > particular CB_RECALL.  The idea behind the patch is to not give the
> > state manager on the client additional work by setting CB_PATH_DOWN
> > when
> > the callback channel is clearly working...
> > 
> 
> Either way, the Linux client will ignore any further sequence flags
> until it is done with the recovery of the RECALLABLE_STATE_REVOKED
> flag. The reason is that the flags are edge triggered (i.e. they don't
> clear until the state changes), and so we need to be able to perform a
> full recovery before we can check them again.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
> 
>