Re: once again problems with interrupted slots

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/5/2020 9:24 AM, Olga Kornievskaia wrote:
On Fri, Jun 5, 2020 at 8:06 AM Tom Talpey <tom@xxxxxxxxxx> wrote:

On 6/4/2020 5:21 PM, Olga Kornievskaia wrote:
Hi Trond,

There is a problem with interrupted slots (yet again).

We send an operation to the server and it gets interrupted by the a signal.

We used to send a sole SEQUENCE to remove the problem of having real
operation get an out of the cache reply and failing. Now we are not
doing it again (since 3453d5708 NFSv4.1: Avoid false retries when RPC
calls are interrupted"). So the problem is

We bump the sequence on the next use of the slot, and get SEQ_MISORDERED.

Misordered? It sounds like the client isn't managing the sequence
number, or perhaps the server never saw the original request, and
is being overly strict.

Well, both the client and the server are acting appropriately.  I'm
not arguing against bumping the sequence. Client sent say REMOVE with
slot=1 seq=5 which got interrupted. So client doesn't know in what
state the slot is left. So it sends the next operation say READ with
slot=1 seq=6. Server acts appropriately too, as it's version of the
slot has seq=4, this request with seq=6 gets SEQ_MISORDERED.

Wait, if the client sent slot=1 seq=5, then unless the connection
breaks, that slot is at seq=5, simple as that. If the operation was
interrupted before sending the request, then the sequence should
not be bumped.

We decrement the number back to the interrupted operation. This gets
us a reply out of the cache. We again fail with REMOTE EIO error.

Ew. The client *decrements* the sequence?

Yes, as client then decides that server never received seq=5 operation
so it re-sends with seq=5. But in reality seq=5 operation also reached
the server so it has 2 requests REMOVE/READ both with seq=5 for
slot=1. This leads to READ failing with some error.

But if the connection didn't break, it's reliable therefore the "resend"
must not be performed. This is a new operation, not a retry. It cannot
use the same slot+seq pair. And decrementing the slot is even sillier,
it's reusing *two* seq's at that point.

We used to before send a sole SEQUENCE when we have an interrupted
slot to sync up the seq numbers. But commit 3453d5708 changed that and
I would like to understand why. As I think we need to go back to
sending sole SEQUENCE.

Sounds like a hack, frankly. What if the server responds the same
way? The client will start burning the wire.

Closing the connection, or never using that slot again, seems to
me the only correct option, given the other behavior described.

Tom.


Going back to the commit's message. I don't see the logic that the
server can't tell if this is a new call or the old one. We used to
send a lone SEQUENCE as a way to protect reuse of slot by a normal
operation. An interrupted slot couldn't have been another SEQUENCE. So
I don't see how the server can't tell a difference between SEQUENCE
and any other operations.







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux