Re: Random IO errors on nfs clients running linux > 4.20

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Oct 26, 2020, at 11:26 AM, Alberto Gonzalez Iniesta <alberto.gonzalez@xxxxxxxx> wrote:
> 
> On Mon, Oct 26, 2020 at 11:06:24AM -0400, Chuck Lever wrote:
>> 
>> 
>>> On Oct 26, 2020, at 11:02 AM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>> On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <alberto.gonzalez@xxxxxxxx> wrote:
>>>>> 
>>>>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
>>>>>>>> So all I notice from this one is the readdir EIO came from call_decode.
>>>>>>>> I suspect that means it failed in the xdr decoding.  Looks like xdr
>>>>>>>> decoding of the actual directory data (which is the complicated part) is
>>>>>>>> done later, so this means it failed decoding the header or verifier,
>>>>>>>> which is a little odd:
>>>>>>>> 
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016276] RPC:  3284 call_decode result -5
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot 
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016286] RPC:       wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016290] RPC:  3284 return 0, status -5
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016291] RPC:  3284 release task
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016295] RPC:       freeing buffer of size 4144 at 00000000a3649daf
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016298] RPC:  3284 release request 0000000079df89b2
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016300] RPC:       wake_up_first(00000000c5ee49ee "xprt_backlog")
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016302] RPC:       rpc_release_client(00000000b930c343)
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016304] RPC:  3284 freeing task
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
>>>>>>>>> Sep  8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>>>>>>> 
>>>>>>> Hi, Bruce et al.
>>>>>>> 
>>>>>>> Is there anything we can do to help debugging/fixing this? It's still
>>>>>>> biting our users with a +4.20.x kernel.
>>>>>> 
>>>>>> Alberto, can you share a snippet of a raw network capture that shows
>>>>>> the READDIR Reply that fails to decode?
>>>>> 
>>>>> Hi, Chuck.
>>>>> 
>>>>> Thanks for your reply. We're using "sec=krb5p", which makes the network
>>>>> capture useless :-(
>>>> 
>>>> You can plug keytabs into Wireshark to enable it to decrypt the traffic.
>>> 
>>> Just skimming that range of history, there's some changes to the
>>> handling of gss sequence numbers, I wonder if there's a chance he could
>>> be hitting that?  You had a workload that would lead to calls dropping
>>> out of the sequence number window, didn't you, Chuck?  Is there a quick
>>> way to check whether that's happening?
>> 
>> The server is supposed to drop the connection when that happens, though
>> I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in
>> itself wouldn't result in EIO.
> 
> In case this is relevant, server is running 3.16.0. Clients (with
> issues) +4.20.

Ah, I see. Well that's an old kernel. Have you engaged your distributor?
They might be able to provide builds with debugging instrumentation, for
example, if we can give them some instructions or a patch.

My experience tells me that this is probably an issue with either the
server's GSS wrap function, or the client's GSS unwrap function, if
you don't ever see this failure without krb5p in the picture.

If you don't see this problem on older clients, than I would start
looking at the client's GSS unwrap function, fwiw.


--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux