> On Apr 29, 2016, at 12:44 PM, Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx> wrote: > > > > On 4/29/2016 9:24 AM, Chuck Lever wrote: >> I've found some new behavior, recently, while testing the >> v4.6-rc Linux NFS/RDMA client and server. >> >> When certain kernel memory debugging CONFIG options are >> enabled, 1MB NFS WRITEs can sometimes result in a >> IB_WC_LOC_PROT_ERR. I usually turn on most of them because >> I want to see any problems, so I'm not sure which option >> in particular is exposing the issue. >> >> When debugging is enabled on the server, and the underlying >> device is using FRWR to register the sink buffer, an RDMA >> Read occasionally completes with LOC_PROT_ERR. >> >> When debugging is enabled on the client, and the underlying >> device uses FRWR to register the target of an RDMA Read, an >> ingress RDMA Read request sometimes gets a Syndrome 99 >> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive >> on the client completes with LOC_PROT_ERR. >> >> I do not see this problem when kernel memory debugging is >> disabled, or when the client is using FMR, or when the >> server is using physical addresses to post its RDMA Read WRs, >> or when wsize is 512KB or smaller. >> >> I have not found any obvious problems with the client logic >> that registers NFS WRITE buffers, nor the server logic that >> constructs and posts RDMA Read WRs. >> > One possibility here could be the mismatch in posted WR for > send/receive. Can you check if for certain cases you are > posting receive WRs which can't handle whats send is putting > on the wire. I've confirmed that the client is posting only 1024-byte Receive buffers, and that the ib_sge for each Receive operation is the same before and after the Receive is posted (ie, the Receive ib_sge is valid and is not getting overwritten somehow). The wire traffic contains Send Only requests of 230 or so bytes. If an ingress Send is too large, the Receive should complete with IB_WC_LOC_LEN_ERR, not IB_WC_LOC_PROT_ERR. The server disconnects due to the REM_OP_ERR. The LOC_PROT_ERR completion appears to be the first Receive completion after the QP is reconnected. The client-side error completion on the Receive WR seems to be a latent report of an earlier problem with an ingress RDMA Read request. >> My next step is to bisect. But first, I was wondering if >> this behavior might be related to the recent problems with >> s/g lists seen with iSER/SRP? ie, is this a recognized >> issue? >> >> >> -- >> Chuck Lever >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html