Re: handling error on RECLAIM_COMPLETE

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 24 Jan 2017 21:21:07 +0000

On Tue, 2017-01-24 at 16:11 -0500, Trond Myklebust wrote:
> On Tue, 2017-01-24 at 15:48 -0500, Olga Kornievskaia wrote:
> > On Tue, Jan 24, 2017 at 3:35 PM, Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > > On Jan 24, 2017, at 14:50, Olga Kornievskaia <aglo@xxxxxxxxx>
> > > > wrote:
> > > > 
> > > > On Tue, Jan 24, 2017 at 2:44 PM, Trond Myklebust
> > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > > On Jan 24, 2017, at 14:40, Olga Kornievskaia <aglo@xxxxxxxx
> > > > > > u>
> > > > > > wrote:
> > > > > > 
> > > > > > On Tue, Jan 24, 2017 at 2:12 PM, Trond Myklebust
> > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > > On Jan 24, 2017, at 14:06, Olga Kornievskaia <aglo@umic
> > > > > > > > h.
> > > > > > > > edu> wrote:
> > > > > > > > 
> > > > > > > > Hi Trond,
> > > > > > > > 
> > > > > > > > Is there a reason that nfs4_proc_reclaim_complete()
> > > > > > > > isn't  wrapped
> > > > > > > > with a do while() to handle errors?
> > > > > > > 
> > > > > > > What do we not already handle correctly in
> > > > > > > nfs4_reclaim_complete_done()?
> > > > > > 
> > > > > > Could this be because when an error occurs rpc_done isn't
> > > > > > called
> > > > > > (rpc_release is called)? What I see is that if
> > > > > > RECLAIM_COMPLETE gets
> > > > > > an error (BAD_SESSION) the client just ignores it.
> > > > > > 
> > > > > 
> > > > > That’s correct. Why do we need to handle BAD_SESSION there?
> > > > > We’re done with state recovery, so if the server rebooted, we
> > > > > can catch that later.
> > > > 
> > > > (1) don't we want to handle session errors as soon as possible?
> > > > (2) I ran into a problem (not sure yet if reproducible) where I
> > > > had a
> > > > client stuck in an infinite loop of RECLAIM_COMPLETE being sent
> > > > with
> > > > reply of BAD_SESSION.
> > > > 
> > > > yes I don't know why the client is looping but it made me look
> > > > into
> > > > the fact that we are not handling session errors on reclaim
> > > > complete
> > > > which I simulated by having the server return BAD_SESSION to
> > > > RECLAIM_COMPLETE and I see that client simply ignores it.
> > > 
> > > It doesn’t ignore it:
> > > 
> > > static int nfs41_reclaim_complete_handle_errors(struct rpc_task
> > > *task, struct nfs_client *clp)
> > > {
> > >         switch(task->tk_status) {
> > >         case 0:
> > >         case -NFS4ERR_COMPLETE_ALREADY:
> > >         case -NFS4ERR_WRONG_CRED: /* What to do here? */
> > >                 break;
> > >         case -NFS4ERR_DELAY:
> > >                 rpc_delay(task, NFS4_POLL_RETRY_MAX);
> > >                 /* fall through */
> > >         case -NFS4ERR_RETRY_UNCACHED_REP:
> > >                 return -EAGAIN;
> > >         default:
> > >                 nfs4_schedule_lease_recovery(clp);
> > >         }
> > >         return 0;
> > > }
> > > 
> > > IOW: what the code does is schedule another round of lease
> > > recovery.
> > 
> > We already agreed that this doesn't get called because
> > rpc_call_done
> > isn't called on the error.
> 
> What am I missing? Why wouldn't it get called?

Sorry. I see I caused that confusion. When I said "that is correct", I
was referring to the fact that the client ignores the error. All it
does in the case of those errors is to schedule recovery and then exit.

The client will always call rpc_call_done() provided that the RPC call
was initialised successfully.

-- 

Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CA  94022
W: 650-422-3800
C: 801-921-4583 
www.primarydata.com

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥