On Mon, Oct 26, 2020 at 11:06:24AM -0400, Chuck Lever wrote: > > > > On Oct 26, 2020, at 11:02 AM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > > > On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote: > >> > >> > >>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <alberto.gonzalez@xxxxxxxx> wrote: > >>> > >>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote: > >>>>>> So all I notice from this one is the readdir EIO came from call_decode. > >>>>>> I suspect that means it failed in the xdr decoding. Looks like xdr > >>>>>> decoding of the actual directory data (which is the complicated part) is > >>>>>> done later, so this means it failed decoding the header or verifier, > >>>>>> which is a little odd: > >>>>>> > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5 > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table") > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295 > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5 > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2 > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog") > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343) > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5 > >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5 > >>>>> > >>>>> Hi, Bruce et al. > >>>>> > >>>>> Is there anything we can do to help debugging/fixing this? It's still > >>>>> biting our users with a +4.20.x kernel. > >>>> > >>>> Alberto, can you share a snippet of a raw network capture that shows > >>>> the READDIR Reply that fails to decode? > >>> > >>> Hi, Chuck. > >>> > >>> Thanks for your reply. We're using "sec=krb5p", which makes the network > >>> capture useless :-( > >> > >> You can plug keytabs into Wireshark to enable it to decrypt the traffic. > > > > Just skimming that range of history, there's some changes to the > > handling of gss sequence numbers, I wonder if there's a chance he could > > be hitting that? You had a workload that would lead to calls dropping > > out of the sequence number window, didn't you, Chuck? Is there a quick > > way to check whether that's happening? > > The server is supposed to drop the connection when that happens, though > I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in > itself wouldn't result in EIO. In case this is relevant, server is running 3.16.0. Clients (with issues) +4.20. -- Alberto González Iniesta | Universidad a Distancia alberto.gonzalez@xxxxxxxx | de Madrid