Hey all, I would really appreciate some help here, I took a capture on the client when this was happening and as you can see from it, it churns the same dozens of times over and over again. https://filebin.net/zs9hfipxbz2mn7i8 Thanks in advance, Doma On Fri, Apr 24, 2020 at 3:45 PM sea you <seayou@xxxxxxxxx> wrote: > > Hey all, > > I would really appreciate some help here, I took a capture on the client when this was happening and as you can see from it, it churns the same dozens of times over and over again. > > https://filebin.net/zs9hfipxbz2mn7i8 > > Thanks in advance. > Doma > > On Mon, Apr 20, 2020 at 4:32 PM sea you <seayou@xxxxxxxxx> wrote: >> >> Dear all, >> >> Time-to-time we're plagued with a lot of "TestStateID" RPC calls on a >> 4.15.0-88 (Ubuntu Bionic) kernel, where clients (~310 VMS) are using >> either 4.19.106 or 4.19.107 (Flatcar Linux). What we see during these >> "storms", is that _some_ clients are testing the same id for callback >> like >> >> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0 >> [Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb >> [Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare >> data->seq_server 000000006dfc86c9 >> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000 >> highest_used=4294967295 max_slots=31 >> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001 >> highest_used=0 slotid=0 >> [Thu Apr 9 15:18:57 2020] encode_sequence: >> sessionid=1585584999:2538115180:5741:0 seqid=13899229 slotid=0 >> max_slotid=0 cache_this=0 >> [Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors: >> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040 >> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001 >> highest_used=0 max_slots=31 >> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003 >> highest_used=1 slotid=1 >> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0 >> [Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot >> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0 >> highest_used_slotid 4294967295 >> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0 >> [Thu Apr 9 15:18:57 2020] NFS call test_stateid 00000000ec5d02eb >> [Thu Apr 9 15:18:57 2020] --> nfs41_call_sync_prepare >> data->seq_server 000000006dfc86c9 >> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0000 >> highest_used=4294967295 max_slots=31 >> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0001 >> highest_used=0 slotid=0 >> [Thu Apr 9 15:18:57 2020] encode_sequence: >> sessionid=1585584999:2538115180:5741:0 seqid=13899230 slotid=0 >> max_slotid=0 cache_this=0 >> [Thu Apr 9 15:18:57 2020] nfs41_handle_sequence_flag_errors: >> "10.1.4.65" (client ID 671b825e6c904897) flags=0x00000040 >> [Thu Apr 9 15:18:57 2020] --> nfs4_alloc_slot used_slots=0001 >> highest_used=0 max_slots=31 >> [Thu Apr 9 15:18:57 2020] <-- nfs4_alloc_slot used_slots=0003 >> highest_used=1 slotid=1 >> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 1 highest_used_slotid 0 >> [Thu Apr 9 15:18:57 2020] nfs41_sequence_process: Error 0 free the slot >> [Thu Apr 9 15:18:57 2020] nfs4_free_slot: slotid 0 >> highest_used_slotid 4294967295 >> [Thu Apr 9 15:18:57 2020] NFS reply test_stateid: succeeded, 0 >> >> Due to this, some processes on some clients are stuck and these nodes >> need to be rebooted. Initially, we thought we're facing the issue that >> was fixed in 44f411c353bf, but as I see we're already using a kernel >> where it was backported to via 90d73c1cadb8. >> >> Clients are mounting as >> "rw,nosuid,nodev,noexec,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,proto=tcp,timeo=600,retrans=2,sec=sys" >> >> Export options are the following >> "<world>(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=762,sec=sys,rw,secure,no_root_squash,no_all_squash)" >> (fsid varies from export to export of course) >> >> Our workload is super metadata heavy (PHP) and data being served >> changes a lot as clients are uploading files etc. >> >> We have a similar setup where clients are 4.19.(6|7)8 (CoreOS) and the >> server is 4.15.0-76, where we rarely see these TestID RPC calls. >> >> It's worth to mention that between the two setups that is okay and the >> one that is not, the main difference is using different block size >> (the one with 512byte is okay, the other one with 4k isn't) in the >> backing filesystem (ZFS), although I'm unsure how would that affect >> NFS at all. >> >> The issue manifests at least once every day. >> >> Can you please point me in a direction that I should check further? >> >> Doma