Re: Reoccurring 5 second delays during NFS calls

Benjamin Coddington <bcodding@xxxxxxxxxx> · Wed, 22 Feb 2023 14:43:34 -0500

On 22 Feb 2023, at 7:48, Florian Möller wrote:

> Am 22.02.23 um 13:22 schrieb Jeff Layton:
>> On Wed, 2023-02-22 at 06:54 -0500, Benjamin Coddington wrote:
>>> On 22 Feb 2023, at 3:19, Florian Möller wrote:
>>>
>>>> Am 21.02.23 um 19:58 schrieb Benjamin Coddington:
>>>>> On 21 Feb 2023, at 11:52, Florian Möller wrote:
>>>>>
>>>>>> Hi Benjamin,
>>>>>>
>>>>>> here are the trace and a listing of the corresponding network packages. If the listing is not detailed enough, I can send you a full package dump tomorrow.
>>>>>>
>>>>>> The command I used was
>>>>>>
>>>>>> touch test.txt && sleep 2 && touch test.txt
>>>>>>
>>>>>> test.txt did not exist previously. So you have an example of a touch without and with delay.
>>>>>
>>>>> Thanks!  These are great - I can see from them that the client is indeed
>>>>> waiting in the stateid update mechanism because the server has returned
>>>>> NFS4ERR_STALE to the client's first CLOSE.
>>>>>
>>>>> That is unusual.  The server is signaling that the open file's stateid is old,
>>>>> so I am interested to see if the first CLOSE is sent with the stateid's
>>>>> sequence that was returned from the server.  I could probably see this if I
>>>>> had the server-side tracepoint data.
>>>>
>>>> Hi Benjamin,
>>>>
>>>> the server-side tracepoints
>>>>
>>>> nfsd:nfsd_preprocess
>>>> sunrpc:svc_process
>>>>
>>>> were enabled. It seems they did not produce any output.
>>>>
>>>> What I did now was:
>>>> - enable all nfsd tracepoints,
>>>> - enable all nfs4 tracepoints,
>>>> - enable all sunrpc tracepoints.
>>>>
>>>> The command I used was
>>>>
>>>> touch somefile && sleep 2 && touch somefile.
>>>>
>>>> Then I unmounted the NFS share - this also causes a delay.
>>>>
>>>> I changed the security type to krb5 and captured trace and network output for a version 4.0 and a version 4.2 mount. The delay does not occur when using version 4.0.
>>>
>>>
>>> In frame 9 of nfs-v4.2-krb5.pcap, the server responds to PUTFH with
>>> NFS4ERR_STALE, so nothing to do with the open stateid sequencing.  I also
>>> see:
>>>
>>> nfsd-1693    [000] .....  1951.353889: nfsd_exp_find_key: fsid=1::{0x0,0xe5fcf0,0xffffc900,0x811e87a3,0xffffffff,0xe5fd00} domain=gss/krb5i status=-2
>>> nfsd-1693    [000] .....  1951.353889: nfsd_set_fh_dentry_badexport: xid=0xe1511810 fh_hash=0x3f9e713a status=-2
>>> nfsd-1693    [000] .....  1951.353890: nfsd_compound_status: op=2/4 OP_PUTFH status=70
>>>
>>
>> This just means that the kernel called into the "cache" infrastructure
>> to find an export entry, and there wasn't one.
>>
>>
>> Looking back at the original email here, I'd say this is expected since
>> the export wasn't set up for krb5i.
>>
>> Output of exportfs -v:
>> /export
>> gss/krb5p(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
>> /export
>> gss/krb5(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
>
> I changed the export definitions to
>
> /export gss/krb5(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
> /export gss/krb5i(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
> /export gss/krb5p(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
>
> such that all three kerberos security types are exported. With this setup the delay is gone. Without the krb5i export the delay occurs again.
> Mounting the NFS share with different kerberos security types does not change this behaviour - without the krb5i export there is a delay.

After getting the ol' KDC back in shape and trying to reproduce this, I
noticed you're using a client specifier that's been deprecated (the
gss/krb5), along with sec=sys in the same line.  That's a little weird, not
sure what the results of that might be.

I can't even get any mount to work with that:

[root@fedora ~]# exportfs -v
/exports      	gss/krb5(sync,wdelay,hide,no_subtree_check,nordirplus,fsid=0,sec=sys,rw,insecure,no_root_squash,no_all_squash)
/exports      	gss/krb5p(sync,wdelay,hide,no_subtree_check,nordirplus,fsid=0,sec=sys,rw,insecure,no_root_squash,no_all_squash)
[root@fedora ~]# mount -t nfs -ov4.1,sec=krb5 fedora.fh.bcodding.com:/exports /mnt/fedora
mount.nfs: mounting fedora.fh.bcodding.com:/exports failed, reason given by server: No such file or directory

However, if I use the "sec=" method, everything seems to work fine:

[root@fedora ~]# exportfs -v
/exports      	10.0.1.0/24(sync,wdelay,hide,no_subtree_check,nordirplus,sec=sys:krb5:krb5p,rw,insecure,no_root_squash,no_all_squash)
[root@fedora ~]# mount -t nfs -ov4.1,sec=krb5 fedora.fh.bcodding.com:/exports /mnt/fedora
[root@fedora ~]# touch /mnt/fedora/krbfoo
[root@fedora ~]# touch /mnt/fedora/krbfoo
[root@fedora ~]# umount /mnt/fedora
[root@fedora ~]# mount -t nfs -ov4.1,sec=krb5p fedora.fh.bcodding.com:/exports /mnt/fedora
[root@fedora ~]# touch /mnt/fedora/krbfoo
[root@fedora ~]# touch /mnt/fedora/krbfoo
[root@fedora ~]# umount /mnt/fedora

.. so I can't seem to reproduce this because I can't even mount.  Did you
perhaps have a previous mount with that client with a different security
flavor?

I think your NFS client ended up having a machine credential with krb5i, but
then the export with that flavor was removed.  When the client tried to do a
CLOSE, it picked the machine cred, which caused the server to fail in the
lookup of the export, which caused the client to think its stateid was
wrong.

Ben