Re: TestStateID woes with recent clients

sea you <seayou@xxxxxxxxx> · Fri, 24 Apr 2020 16:54:58 +0200

Hey all,

I would really appreciate some help here, I took a capture on the
client when this was happening and as you can see from it, it churns
the same dozens of times over and over again.

https://filebin.net/zs9hfipxbz2mn7i8

Thanks in advance,
Doma

On Fri, Apr 24, 2020 at 3:45 PM sea you <seayou@xxxxxxxxx> wrote:
>
> Hey all,
>
> I would really appreciate some help here, I took a capture on the client when this was happening and as you can see from it, it churns the same dozens of times over and over again.
>
> https://filebin.net/zs9hfipxbz2mn7i8
>
> Thanks in advance.
> Doma
>
> On Mon, Apr 20, 2020 at 4:32 PM sea you <seayou@xxxxxxxxx> wrote:
>>
>> Dear all,
>>
>> Time-to-time we're plagued with a lot of "TestStateID" RPC calls on a
>> 4.15.0-88 (Ubuntu Bionic) kernel, where clients (~310 VMS) are using
>> either 4.19.106 or 4.19.107 (Flatcar Linux). What we see during these
>> "storms", is that _some_ clients are testing the same id for callback
>> like
>>
>> [Thu Apr  9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>> [Thu Apr  9 15:18:57 2020] NFS call  test_stateid 00000000ec5d02eb
>> [Thu Apr  9 15:18:57 2020] --> nfs41_call_sync_prepare
>> data->seq_server 000000006dfc86c9
>> [Thu Apr  9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
>> highest_used=4294967295 max_slots=31
>> [Thu Apr  9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
>> highest_used=0 slotid=0
>> [Thu Apr  9 15:18:57 2020] encode_sequence:
>> sessionid=1585584999:2538115180:5741:0 seqid=13899229 slotid=0
>> max_slotid=0 cache_this=0
>> [Thu Apr  9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
>> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
>> [Thu Apr  9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
>> highest_used=0 max_slots=31
>> [Thu Apr  9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
>> highest_used=1 slotid=1
>> [Thu Apr  9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
>> [Thu Apr  9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
>> [Thu Apr  9 15:18:57 2020] nfs4_free_slot: slotid 0
>> highest_used_slotid 4294967295
>> [Thu Apr  9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>> [Thu Apr  9 15:18:57 2020] NFS call  test_stateid 00000000ec5d02eb
>> [Thu Apr  9 15:18:57 2020] --> nfs41_call_sync_prepare
>> data->seq_server 000000006dfc86c9
>> [Thu Apr  9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000
>> highest_used=4294967295 max_slots=31
>> [Thu Apr  9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001
>> highest_used=0 slotid=0
>> [Thu Apr  9 15:18:57 2020] encode_sequence:
>> sessionid=1585584999:2538115180:5741:0 seqid=13899230 slotid=0
>> max_slotid=0 cache_this=0
>> [Thu Apr  9 15:18:57 2020] nfs41_handle_sequence_flag_errors:
>> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040
>> [Thu Apr  9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001
>> highest_used=0 max_slots=31
>> [Thu Apr  9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003
>> highest_used=1 slotid=1
>> [Thu Apr  9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0
>> [Thu Apr  9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot
>> [Thu Apr  9 15:18:57 2020] nfs4_free_slot: slotid 0
>> highest_used_slotid 4294967295
>> [Thu Apr  9 15:18:57 2020] NFS reply test_stateid: succeeded, 0
>>
>> Due to this, some processes on some clients are stuck and these nodes
>> need to be rebooted. Initially, we thought we're facing the issue that
>> was fixed in 44f411c353bf, but as I see we're already using a kernel
>> where it was backported to via 90d73c1cadb8.
>>
>> Clients are mounting as
>> "rw,nosuid,nodev,noexec,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,proto=tcp,timeo=600,retrans=2,sec=sys"
>>
>> Export options are the following
>> "<world>(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=762,sec=sys,rw,secure,no_root_squash,no_all_squash)"
>> (fsid varies from export to export of course)
>>
>> Our workload is super metadata heavy (PHP) and data being served
>> changes a lot as clients are uploading files etc.
>>
>> We have a similar setup where clients are 4.19.(6|7)8 (CoreOS) and the
>> server is 4.15.0-76, where we rarely see these TestID RPC calls.
>>
>> It's worth to mention that between the two setups that is okay and the
>> one that is not, the main difference is using different block size
>> (the one with 512byte is okay, the other one with 4k isn't) in the
>> backing filesystem (ZFS), although I'm unsure how would that affect
>> NFS at all.
>>
>> The issue manifests at least once every day.
>>
>> Can you please point me in a direction that I should check further?
>>
>> Doma