Recovery after BAD_SEQID

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 19 Mar 2015 06:48:47 -0400 (EDT)

I wrote yesterday about a RHEL6 bug, but I'd gotten some details wrong about
the problem, so I'm starting new thread.

It looks like getting BAD_SEQID back from an OPEN operation drops the state_owner
which means that the state machine can't find or recover any other objects
for that state_owner.  That can get the client into unrecoverable loops.  I
can produce one of them with:

1) OPEN file1, OPEN file2
2) break the network for longer than the lease period
3) during recovery, have the server return BAD_SEQID for one of the OPENS
4) break the network again for longer than the lease period
5) WRITE to the file that recovered properly in #3

This gets stuck in WRITE,NFS4ERR_EXPIRED.

It looks like some cleanup is needed if we have to drop the whole
state_owner.  Alternatively, does it make sense to just drop the objects in
that sequence?

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html