On Tue, Aug 04, 2020 at 11:34:05AM -0400, Chuck Lever wrote: > > > > On Aug 4, 2020, at 9:53 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > > > > > > >> On Aug 4, 2020, at 9:46 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: > >> > >> On Tue, Aug 04, 2020 at 09:12:55AM -0400, Chuck Lever wrote: > >>> > >>> > >>>> On Aug 4, 2020, at 9:08 AM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> On 04.08.2020 14:49, Chuck Lever wrote: > >>>>> Timo, I tend to think this is not a configuration issue. > >>>>> Do you know of a known working kernel? > >>>> > >>>> This is a brand new system, it's never been running with any kernel older than 5.4, and downgrading it to 4.19 or something else while in operation is unfortunately not easily possible. For a client it would definitely not be out of the question, but the main nfs server I cannot easily downgrade. > >>>> > >>>> Also keep in mind that the dmesg spam happens on both server and client simultaneously. > >>> > >>> Let's start with the client only, since restarting it seems to clear the problem. > >> > >> It is client because according to the server CQE errors, it is Remote_Invalid_Request_Error > >> with "9.7.5.2.2 NAK CODES" from IBTA. > > > > Thanks! OK, then let's use ftrace. > > > > Timo, can you install trace-cmd on your client? Then: > > > > 1. # trace-cmd record -e rpcrdma -e sunrpc > > > > 2. Trigger the problem > > > > 3. Control-C the trace-cmd, and copy the trace.dat file to another system > > > > 4. reboot your client > > > > Then send me your trace.dat. You don't have to cc the mailing lists. > > I see a LOC_LEN_ERR on a Receive. Leon, doesn't that mean the server's > Send was too large? 1. We have local_length_error counter, it can help to run it on server and clients. [leonro@vm ~]$ cat /sys/class/infiniband/ibp0s9/ports/1/hw_counters/resp_local_length_error 0 resp_local_length_error - "Number of times responder detected local length errors." 2. LOC_LEN_ERR supports that is written in CQE error on the client. This is what is written in our HW document: IB compliant completion with error syndrome 0x1: Local_Length_Error 3. >From IBTA, 11.6.2 COMPLETION RETURN STATUS Local Length Error - Generated for a Work Request posted to the local Send Queue when the sum of the Data Segment lengths exceeds the message length for the channel adapter port. Generated for a Work Request posted to the local Receive Queue when the sum of the Data Segment lengths is too small to receive a valid incoming message or the length of the incoming message is greater than the maximum message size supported by the HCA port that received the message. So if "1" works :), we will be able to distinguish if client sends too large WR or recieves too large. Thanks > > Timo, what filesystem are you sharing on your NFS server? The thing that > comes to mind is https://bugzilla.kernel.org/show_bug.cgi?id=198053 > > > -- > Chuck Lever > > >