Re: reuse of slot and seq# when RPC was interrupted

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 23 Sep 2016 19:57:35 +0000



> On Sep 23, 2016, at 15:27, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> 
> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
>> 
>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>> 
>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>> 
>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>> 
>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> If we instead bump the sequence number in the case of interrupted and do:
>>>>>>>> 
>>>>>>>> You have no guarantees that the server has seen and processed the operation.
>>>>>>> 
>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>> receive the operation and client have an interrupted slot. On the next
>>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>>> EIO”.
>>>>>>> 
>>>>>> 
>>>>>> Why not?
>>>>> 
>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>> recovery (are you suggesting we should?)  and returns that to the
>>>>> application.
>>>>> 
>>>> 
>>>> I’m saying that if we get a SEQ_MISORDERED due to a previous interrupt on that slot, then we should ignore the error in task->tk_status, and just retry after bumping the slot seqid.
>>>> 
>>> 
>>> I'm confused where your objection lies. Are you ok with bumping the
>>> sequence # when task->tk_status = 1 and saying that we should still
>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>> slot?
>>> Wouldn't that create a difference of 2 slots for the server that has
>>> received the original request?
>>> 
>> 
>> I’m saying I’d prefer to keep the current code, but fix the retry that is apparently broken. If we’re not ignoring the task->tk_error when we decide to retry, then that’s a bug in my opinion.
> 
> I'm not understand what you are suggestion. I do better with example
> so allow me:
> 
> REMOVE used slot 0 seq=00000036 received ctrl-c
> nfs41_sequence_done() gets called task->tk_status = 1:
> slot->interrupted is set to 1. slot is freed.
> 
> next operation comes in, in my case it's ACCESS. initialization of the
> sequence uses slot 0 seq=00000036
> server replies with REMOVE
> 
> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
> returns EREMOTEIO. handle error just returns that error.
> 
> where do we retry?
> 

The retry should be happening when we exit from nfs41_sequence_done() by restarting the RPC.��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥