RE: NFS4ERR_STALE_CLIENTID loop

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Sat, 29 Oct 2011 13:42:34 -0700

> -----Original Message-----
> From: David Flynn [mailto:davidf@xxxxxxxxxxxx]
> Sent: Saturday, October 29, 2011 9:53 PM
> To: Myklebust, Trond
> Cc: Chuck Lever; J. Bruce Fields; David Flynn;
linux-nfs@xxxxxxxxxxxxxxx
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
> 
> * Myklebust, Trond (Trond.Myklebust@xxxxxxxxxx) wrote:
> > > -----Original Message-----
> > > From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx] On Oct 29, 2011,
> > > at 2:47 PM, J. Bruce Fields wrote:
> > > > Yes, and it's not something I care that strongly about, really,
my
> > > > only observation is that this sort of failure (an implementation
> > > > bug on one side or another resulting in a loop) seems to have
been
> > > > common (based on no hard data, just my vague memories of list
> > > > threads), and the results fairly obnoxious (possibly even for
> > > > unrelated hosts on the network).
> > > > So if there's some simple way to fail more gracefully it might
be
> > > > helpful.
> > >
> > > For what it's worth, I agree that client implementations should
> > > attempt to behave more gracefully in the face of server problems,
be
> > > they the result of bugs or the result of other issues specific to
> > > that server.  Problems like this make NFSv4 as a protocol look
bad.
> >
> > I can't see what a client can do in this situation except possibly
> > just give up after a while and throw a SERVER_BROKEN error (which
> > means data loss). That still won't make NFSv4 look good...
> 
> Indeed, it is a quite the dilemma.
> 
> I agree that giving and guaranteeing unattended data loss is bad (data
loss at
> the behest of an operator is ok, afterall they can always fence a
broken
> machine).
> 
> Looking at some of the logs again, even going back to the very
original case, it
> appears to be about 600us between retries (RTT=400us).  Is there any
way to
> make that less aggressive?, eg 1s? -- that'd reduce the impact by
three
> orders of magnitude.  What would be the down-side?  How often do you
> expect to get a BAD_STATEID error?

BAD_STATEID is a different matter, and is one that we should have
resolved in the NFS client in the upstream kernel. At least on newer
clients, we should be trying to reopen the file and re-establish all
locks when we get a BAD_STATEID. Can you please remind us which kernel
you are using?

That said... Even on new clients, the recovery attempt may fail due to
the STALE_CLIENTID bug. That will still hit us when we call OPEN in
order to get a new stateid.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html