Re: Recovery after BAD_SEQID

Benjamin Coddington <bcodding@xxxxxxxxxx> · Mon, 23 Mar 2015 10:27:13 -0400 (EDT)

On Mon, 23 Mar 2015, Trond Myklebust wrote:

> On Mon, Mar 23, 2015 at 5:15 AM, Benjamin Coddington
> <bcodding@xxxxxxxxxx> wrote:
> > On Sun, 22 Mar 2015, Trond Myklebust wrote:
> >
> >> On Thu, Mar 19, 2015 at 6:48 AM, Benjamin Coddington
> >> <bcodding@xxxxxxxxxx> wrote:
> >> > I wrote yesterday about a RHEL6 bug, but I'd gotten some details wrong about
> >> > the problem, so I'm starting new thread.
> >> >
> >> > It looks like getting BAD_SEQID back from an OPEN operation drops the state_owner
> >> > which means that the state machine can't find or recover any other objects
> >> > for that state_owner.  That can get the client into unrecoverable loops.  I
> >> > can produce one of them with:
> >> >
> >> > 1) OPEN file1, OPEN file2
> >> > 2) break the network for longer than the lease period
> >> > 3) during recovery, have the server return BAD_SEQID for one of the OPENS
> >> > 4) break the network again for longer than the lease period
> >> > 5) WRITE to the file that recovered properly in #3
> >> >
> >> > This gets stuck in WRITE,NFS4ERR_EXPIRED.
> >> >
> >> > It looks like some cleanup is needed if we have to drop the whole
> >> > state_owner.  Alternatively, does it make sense to just drop the objects in
> >> > that sequence?
> >> >
> >> >
> >>
> >> Ummm... Why are you seeing BAD_SEQID in the first place? That specific
> >> error means that the client and server disagree on the sequencing of
> >> the OPENs, which means there is a bug either on the client or on the
> >> server.
> >
> > It definitely needs a server bug to get here, and unfortunately that server
> > bug is out there.  I'd like to have the client not get stuck when
> > encountering this bug.  Recovery here would mean that we return
> > EIO instead of getting stuck endlessly trying to complete a write for
> > another open file.
>
> We do _not_ fix server bugs on the client.

Yes, I understand and agree.

> > I wonder now what should be the position of the client upon "discovering"
> > there's a bug somewhere.  That bug could be client or server.  Should the
> > client blacklist the server at that point, or can other sequences continue?
>
> It should do its best to report the server as being buggy, if that is
> the case, and then make a limited effort to continue (the key word
> here being: "limited").

Then it sounds like failing any IO that depends upon unrecoverable state
might fall into that limited effort.  I'll see what I can do about that.

Thanks Trond.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html