Re: once again problems with interrupted slots

Olga Kornievskaia <aglo@xxxxxxxxx> · Fri, 5 Jun 2020 09:24:11 -0400

On Fri, Jun 5, 2020 at 8:06 AM Tom Talpey <tom@xxxxxxxxxx> wrote:
>
> On 6/4/2020 5:21 PM, Olga Kornievskaia wrote:
> > Hi Trond,
> >
> > There is a problem with interrupted slots (yet again).
> >
> > We send an operation to the server and it gets interrupted by the a signal.
> >
> > We used to send a sole SEQUENCE to remove the problem of having real
> > operation get an out of the cache reply and failing. Now we are not
> > doing it again (since 3453d5708 NFSv4.1: Avoid false retries when RPC
> > calls are interrupted"). So the problem is
> >
> > We bump the sequence on the next use of the slot, and get SEQ_MISORDERED.
>
> Misordered? It sounds like the client isn't managing the sequence
> number, or perhaps the server never saw the original request, and
> is being overly strict.

Well, both the client and the server are acting appropriately.  I'm
not arguing against bumping the sequence. Client sent say REMOVE with
slot=1 seq=5 which got interrupted. So client doesn't know in what
state the slot is left. So it sends the next operation say READ with
slot=1 seq=6. Server acts appropriately too, as it's version of the
slot has seq=4, this request with seq=6 gets SEQ_MISORDERED.

> > We decrement the number back to the interrupted operation. This gets
> > us a reply out of the cache. We again fail with REMOTE EIO error.
>
> Ew. The client *decrements* the sequence?

Yes, as client then decides that server never received seq=5 operation
so it re-sends with seq=5. But in reality seq=5 operation also reached
the server so it has 2 requests REMOVE/READ both with seq=5 for
slot=1. This leads to READ failing with some error.

We used to before send a sole SEQUENCE when we have an interrupted
slot to sync up the seq numbers. But commit 3453d5708 changed that and
I would like to understand why. As I think we need to go back to
sending sole SEQUENCE.

> Tom.
>
> > Going back to the commit's message. I don't see the logic that the
> > server can't tell if this is a new call or the old one. We used to
> > send a lone SEQUENCE as a way to protect reuse of slot by a normal
> > operation. An interrupted slot couldn't have been another SEQUENCE. So
> > I don't see how the server can't tell a difference between SEQUENCE
> > and any other operations.
> >
> >