> On Oct 26, 2020, at 11:02 AM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote: >> >> >>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <alberto.gonzalez@xxxxxxxx> wrote: >>> >>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote: >>>>>> So all I notice from this one is the readdir EIO came from call_decode. >>>>>> I suspect that means it failed in the xdr decoding. Looks like xdr >>>>>> decoding of the actual directory data (which is the complicated part) is >>>>>> done later, so this means it failed decoding the header or verifier, >>>>>> which is a little odd: >>>>>> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5 >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table") >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295 >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5 >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2 >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog") >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343) >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5 >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5 >>>>> >>>>> Hi, Bruce et al. >>>>> >>>>> Is there anything we can do to help debugging/fixing this? It's still >>>>> biting our users with a +4.20.x kernel. >>>> >>>> Alberto, can you share a snippet of a raw network capture that shows >>>> the READDIR Reply that fails to decode? >>> >>> Hi, Chuck. >>> >>> Thanks for your reply. We're using "sec=krb5p", which makes the network >>> capture useless :-( >> >> You can plug keytabs into Wireshark to enable it to decrypt the traffic. > > Just skimming that range of history, there's some changes to the > handling of gss sequence numbers, I wonder if there's a chance he could > be hitting that? You had a workload that would lead to calls dropping > out of the sequence number window, didn't you, Chuck? Is there a quick > way to check whether that's happening? The server is supposed to drop the connection when that happens, though I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in itself wouldn't result in EIO. -- Chuck Lever