On Sat, Mar 16, 2019 at 10:11 AM Jiufei Xue <jiufei.xue@xxxxxxxxxxxxxxxxx> wrote: > > > Hi Olga, > On 2019/3/16 上午4:33, Olga Kornievskaia wrote: > > On Fri, Mar 15, 2019 at 2:31 AM Jiufei Xue <jiufei.xue@xxxxxxxxxxxxxxxxx> wrote: > >> > >> Hi Olga, > >> > >> On 2019/3/11 下午11:13, Olga Kornievskaia wrote: > >>> Let me double check that. I have reproduced the "infinite loop" or > >>> CLOSE on the upstream (I'm looking thru the trace points from friday). > >> > >> Do you try to capture the packages when reproduced this issue on the > >> upstream. I still lost kernel packages after some adjustment according > >> to bfield's suggestion :( > > > > Hi Jiufei, > > > > Yes I have network trace captures but they are too big to post to the > > mailing list. I have reproduced the problem on the latest upstream > > origin/testing branch commit "SUNRPC: Take the transport send lock > > before binding+connecting". As you have noted before infinite loops is > > due to client "losing" an update to the seqid. > > > > one packet would send out an (recovery) OPEN with slot=0 seqid=Y. > > tracepoint (nfs4_open_file) would log that status=ERESTARTSYS. The rpc > > task would be sent and the rpc task would receive a reply but there is > > nobody there to receive it... This open that got a reply has an > > updated stateid seqid which client never updates. When CLOSE is sent, > > it's sent with the "old" stateid and puts the client in an infinite > > loop. Btw, CLOSE is sent on the interrupted slot which should get > > FALSE_RETRY which causes the client to terminate the session. But it > > would still keep sending the CLOSE with the old stateid. > > > > Some things I've noticed is that TEST_STATE op (as a part of the > > nfs41_test_and _free_expired_stateid()) for some reason always has a > > signal set even before issuing and RPC task so the task never > > completes (ever). > > > > I always thought that OPEN's can't be interrupted but I guess they are > > since they call rpc_wait_for_completion_task() and that's a killable > > event. But I don't know how to find out what's sending a signal to the > > process. I'm rather stuck here trying to figure out where to go from > > there. So I'm still trying to figure out what's causing the signal or > > also how to recover from it that the client doesn't lose that seqid. > > > >> > Thank you for you quick relpy. > > I have also noticed the ERESTARTSYS status for OPEN op, but I think it > is returned by the open process which is woken in nfs4_run_open_task(). > I found that the decode routine nfs4_xdr_dec_open returned -121, which > I thought is the root cause of this problem. Could you please post the > content of the last OPEN message? Hi Jiufei, Yes I think that's why the update isn't happening because the rpc_status isn't 0. Trond, rpc_status of the rpc tasks that were interrupted but are finishing are not able to succeed because when they try to decode_sequence the res->st_slot is NULL. Sequence op is not decoded and then when it tries to decode the PUTFH it throws unexpected op (expecting PUTFH but has SEQ still there instead). res->st_slot is going away because after the open(s) were interrupted and _nfs4_proc_open() returned an error (interrupted), it goes and frees the slot. Is it perhaps appropriate to only free the slot there when if (!data->cancelled) free_slot() otherwise. Let the async RPC task continue and once it's done it'll free the slot. How's this for a proposed fix: Subject: [PATCH 1/1] NFSv4.1 don't free interrupted slot on open Allow the async rpc task for finish and update the open state if needed, then free the slot. Otherwise, the async rpc unable to decode the reply. Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> --- fs/nfs/nfs4proc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 4dbb0ee..96c2499 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -2933,7 +2933,8 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata, } out: - nfs4_sequence_free_slot(&opendata->o_res.seq_res); + if (!opendata->cancelled) + nfs4_sequence_free_slot(&opendata->o_res.seq_res); return ret; } > > Thanks, > Jiufei. > > > > >> Thanks, > >> Jiufei > >