Re: [PATCH v5 00/25] Fix delegation behaviour when server revokes some state

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Sun, 18 Sep 2016 22:38:07 +0000

> On Sep 18, 2016, at 12:23, Oleg Drokin <green@xxxxxxxxxxxxxx> wrote:
> 
> 
> On Sep 17, 2016, at 9:19 PM, Oleg Drokin wrote:
> 
>> 
>> On Sep 17, 2016, at 5:59 PM, Trond Myklebust wrote:
>> 
>>> 
>>>> On Sep 17, 2016, at 17:55, Oleg Drokin <green@xxxxxxxxxxxxxx> wrote:
>>>> 
>>>> 
>>>> On Sep 17, 2016, at 5:45 PM, Trond Myklebust wrote:
>>>> 
>>>>> On Sat, 2016-09-17 at 15:32 -0400, Trond Myklebust wrote:
>>>>>> On Sat, 2016-09-17 at 15:16 -0400, Oleg Drokin wrote:
>>>>>>> 
>>>>>>> On Sep 17, 2016, at 2:18 PM, Trond Myklebust wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 17, 2016, at 14:04, Oleg Drokin <green@xxxxxxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 17, 2016, at 1:13 AM, Trond Myklebust wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> According to RFC5661, if any of the SEQUENCE status bits
>>>>>>>>>> SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED,
>>>>>>>>>> SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED,
>>>>>>>>>> SEQ4_STATUS_ADMIN_STATE_REVOKED,
>>>>>>>>>> or SEQ4_STATUS_RECALLABLE_STATE_REVOKED are set, then we need
>>>>>>>>>> to use
>>>>>>>>>> TEST_STATEID to figure out which stateids have been revoked,
>>>>>>>>>> so
>>>>>>>>>> we
>>>>>>>>>> can acknowledge the loss of state using FREE_STATEID.
>>>>>>>>>> 
>>>>>>>>>> While we already do this for open and lock state, we have not
>>>>>>>>>> been doing
>>>>>>>>>> so for all the delegations.
>>>>>>>>>> 
>>>>>>>>>> v2: nfs_v4_2_minor_ops needs to set .test_and_free_expired
>>>>>>>>>> too
>>>>>>>>>> v3: Now with added lock revoke fixes and
>>>>>>>>>> close/delegreturn/locku fixes
>>>>>>>>>> v4: Close a bunch of corner cases
>>>>>>>>>> v5: Report revoked delegations as invalid in
>>>>>>>>>> nfs_have_delegation()
>>>>>>>>>> Fix an infinite loop in nfs_reap_expired_delegations.
>>>>>>>>>> Fixes for other looping behaviour
>>>>>>>>> 
>>>>>>>>> This time around the loop seems to be more tight,
>>>>>>>>> in userspace process:
>>>>>>>>> 
>>>>>>>>> [ 9197.256571] --> nfs41_call_sync_prepare data->seq_server
>>>>>>>>> ffff8800a73ce000
>>>>>>>>> [ 9197.256572] --> nfs41_setup_sequence
>>>>>>>>> [ 9197.256573] --> nfs4_alloc_slot used_slots=0000
>>>>>>>>> highest_used=4294967295 max_slots=31
>>>>>>>>> [ 9197.256574] <-- nfs4_alloc_slot used_slots=0001
>>>>>>>>> highest_used=0
>>>>>>>>> slotid=0
>>>>>>>>> [ 9197.256574] <-- nfs41_setup_sequence slotid=0 seqid=14013800
>>>>>>>>> [ 9197.256582] encode_sequence: sessionid=1474126170:1:2:0
>>>>>>>>> seqid=14013800 slotid=0 max_slotid=0 cache_this=1
>>>>>>>>> [ 9197.256755] --> nfs4_alloc_slot used_slots=0001
>>>>>>>>> highest_used=0
>>>>>>>>> max_slots=31
>>>>>>>>> [ 9197.256756] <-- nfs4_alloc_slot used_slots=0003
>>>>>>>>> highest_used=1
>>>>>>>>> slotid=1
>>>>>>>>> [ 9197.256757] nfs4_free_slot: slotid 1 highest_used_slotid 0
>>>>>>>>> [ 9197.256758] nfs41_sequence_process: Error 0 free the slot 
>>>>>>>>> [ 9197.256760] nfs4_free_slot: slotid 0 highest_used_slotid
>>>>>>>>> 4294967295
>>>>>>>>> [ 9197.256779] --> nfs_put_client({2})
>>>>>>>> 
>>>>>>>> What operation is the userspace process hanging on? Do you have a
>>>>>>>> stack trace for it?
>>>>>>> 
>>>>>>> seems to be open_create->truncate->ssetattr coming from:
>>>>>>> cp /bin/sleep /mnt/nfs2/racer/12
>>>>>>> 
>>>>>>> (gdb) bt
>>>>>>> #0  nfs41_setup_sequence (session=0xffff88005a853800,
>>>>>>> args=0xffff8800a7253b80, 
>>>>>>> res=0xffff8800a7253b48, task=0xffff8800b0eb0f00)
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:876
>>>>>>> #1  0xffffffff813a751c in nfs41_call_sync_prepare (task=<optimized
>>>>>>> out>, 
>>>>>>> calldata=0xffff8800a7253b80)
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:966
>>>>>>> #2  0xffffffff8185c639 in rpc_prepare_task (task=<optimized out>)
>>>>>>> at /home/green/bk/linux-test/net/sunrpc/sched.c:683
>>>>>>> #3  0xffffffff8185f12b in __rpc_execute (task=0xffff88005a853800)
>>>>>>> at /home/green/bk/linux-test/net/sunrpc/sched.c:775
>>>>>>> #4  0xffffffff818617b4 in rpc_execute (task=0xffff88005a853800)
>>>>>>> at /home/green/bk/linux-test/net/sunrpc/sched.c:843
>>>>>>> #5  0xffffffff818539b9 in rpc_run_task
>>>>>>> (task_setup_data=0xffff8800a7253a50)
>>>>>>> at /home/green/bk/linux-test/net/sunrpc/clnt.c:1052
>>>>>>> #6  0xffffffff813a75e3 in nfs4_call_sync_sequence (clnt=<optimized
>>>>>>> out>, 
>>>>>>> server=<optimized out>, msg=<optimized out>, args=<optimized
>>>>>>> out>, 
>>>>>>> res=<optimized out>) at /home/green/bk/linux-
>>>>>>> test/fs/nfs/nfs4proc.c:1051
>>>>>>> #7  0xffffffff813b4645 in nfs4_call_sync (cache_reply=<optimized
>>>>>>> out>, 
>>>>>>> res=<optimized out>, args=<optimized out>, msg=<optimized
>>>>>>> out>, 
>>>>>>> server=<optimized out>, clnt=<optimized out>)
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:1069
>>>>>>> #8  _nfs4_do_setattr (state=<optimized out>, cred=<optimized out>, 
>>>>>>> res=<optimized out>, arg=<optimized out>, inode=<optimized
>>>>>>> out>)
>>>>>>> ---Type <return> to continue, or q <return> to quit---
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:2916
>>>>>>> #9  nfs4_do_setattr (inode=0xffff880079b152a8, cred=<optimized
>>>>>>> out>, 
>>>>>>> fattr=<optimized out>, sattr=<optimized out>,
>>>>>>> state=0xffff880060588e00, 
>>>>>>> ilabel=<optimized out>, olabel=0x0 <irq_stack_union>)
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:2955
>>>>>>> #10 0xffffffff813b4a16 in nfs4_proc_setattr (dentry=<optimized
>>>>>>> out>, 
>>>>>>> fattr=0xffff8800a7253b80, sattr=0xffff8800a7253b48)
>>>>>>> at /home/green/bk/linux-test/fs/nfs/nfs4proc.c:3684
>>>>>>> #11 0xffffffff8138f1cb in nfs_setattr (dentry=0xffff8800740c1000, 
>>>>>>> 
>>>>>> 
>>>>>> Cool! Does the following help?
>>>>> 
>>>>> Grrr... There is another bug there…
>>>> 
>>>> Is this in addition to the previous patch or instead of?
>>> 
>>> It can apply on top of it. The two patches fix different bugs.
>> 
>> Ok, it does not seem to have helped:
> 
> Of the 10 nodes where I run this, I had some more failures overnight,
> some of them different.
> 
> The other one I got was (I have 3 nodes in this state,
> only one in the one reported yesterday.):

I’ve been trying desperately to reproduce the issue, but I can’t make knfsd return or revoke the delegations with the “racer” script that you pointed me to (I’m using the stock knfsd from Linux 4.7.3). Is there anything else special about your setup that I should know about?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html