Re: handling error on RECLAIM_COMPLETE

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 24 Jan 2017 21:11:34 +0000

On Tue, 2017-01-24 at 15:48 -0500, Olga Kornievskaia wrote:
> On Tue, Jan 24, 2017 at 3:35 PM, Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > > On Jan 24, 2017, at 14:50, Olga Kornievskaia <aglo@xxxxxxxxx>
> > > wrote:
> > > 
> > > On Tue, Jan 24, 2017 at 2:44 PM, Trond Myklebust
> > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > > On Jan 24, 2017, at 14:40, Olga Kornievskaia <aglo@xxxxxxxxx>
> > > > > wrote:
> > > > > 
> > > > > On Tue, Jan 24, 2017 at 2:12 PM, Trond Myklebust
> > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > > > On Jan 24, 2017, at 14:06, Olga Kornievskaia <aglo@umich.
> > > > > > > edu> wrote:
> > > > > > > 
> > > > > > > Hi Trond,
> > > > > > > 
> > > > > > > Is there a reason that nfs4_proc_reclaim_complete()
> > > > > > > isn't  wrapped
> > > > > > > with a do while() to handle errors?
> > > > > > 
> > > > > > What do we not already handle correctly in
> > > > > > nfs4_reclaim_complete_done()?
> > > > > 
> > > > > Could this be because when an error occurs rpc_done isn't
> > > > > called
> > > > > (rpc_release is called)? What I see is that if
> > > > > RECLAIM_COMPLETE gets
> > > > > an error (BAD_SESSION) the client just ignores it.
> > > > > 
> > > > 
> > > > That’s correct. Why do we need to handle BAD_SESSION there?
> > > > We’re done with state recovery, so if the server rebooted, we
> > > > can catch that later.
> > > 
> > > (1) don't we want to handle session errors as soon as possible?
> > > (2) I ran into a problem (not sure yet if reproducible) where I
> > > had a
> > > client stuck in an infinite loop of RECLAIM_COMPLETE being sent
> > > with
> > > reply of BAD_SESSION.
> > > 
> > > yes I don't know why the client is looping but it made me look
> > > into
> > > the fact that we are not handling session errors on reclaim
> > > complete
> > > which I simulated by having the server return BAD_SESSION to
> > > RECLAIM_COMPLETE and I see that client simply ignores it.
> > 
> > It doesn’t ignore it:
> > 
> > static int nfs41_reclaim_complete_handle_errors(struct rpc_task
> > *task, struct nfs_client *clp)
> > {
> >         switch(task->tk_status) {
> >         case 0:
> >         case -NFS4ERR_COMPLETE_ALREADY:
> >         case -NFS4ERR_WRONG_CRED: /* What to do here? */
> >                 break;
> >         case -NFS4ERR_DELAY:
> >                 rpc_delay(task, NFS4_POLL_RETRY_MAX);
> >                 /* fall through */
> >         case -NFS4ERR_RETRY_UNCACHED_REP:
> >                 return -EAGAIN;
> >         default:
> >                 nfs4_schedule_lease_recovery(clp);
> >         }
> >         return 0;
> > }
> > 
> > IOW: what the code does is schedule another round of lease
> > recovery.
> 
> We already agreed that this doesn't get called because rpc_call_done
> isn't called on the error.

What am I missing? Why wouldn't it get called?

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥