Re: RDMA Read: Local protection error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Apr 29, 2016, at 12:44 PM, Santosh Shilimkar <santosh.shilimkar@xxxxxxxxxx> wrote:
> 
> 
> 
> On 4/29/2016 9:24 AM, Chuck Lever wrote:
>> I've found some new behavior, recently, while testing the
>> v4.6-rc Linux NFS/RDMA client and server.
>> 
>> When certain kernel memory debugging CONFIG options are
>> enabled, 1MB NFS WRITEs can sometimes result in a
>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>> I want to see any problems, so I'm not sure which option
>> in particular is exposing the issue.
>> 
>> When debugging is enabled on the server, and the underlying
>> device is using FRWR to register the sink buffer, an RDMA
>> Read occasionally completes with LOC_PROT_ERR.
>> 
>> When debugging is enabled on the client, and the underlying
>> device uses FRWR to register the target of an RDMA Read, an
>> ingress RDMA Read request sometimes gets a Syndrome 99
>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>> on the client completes with LOC_PROT_ERR.
>> 
>> I do not see this problem when kernel memory debugging is
>> disabled, or when the client is using FMR, or when the
>> server is using physical addresses to post its RDMA Read WRs,
>> or when wsize is 512KB or smaller.
>> 
>> I have not found any obvious problems with the client logic
>> that registers NFS WRITE buffers, nor the server logic that
>> constructs and posts RDMA Read WRs.
>> 
> One possibility here could be the mismatch in posted WR for
> send/receive. Can you check if for certain cases you are
> posting receive WRs which can't handle whats send is putting
> on the wire.

I've confirmed that the client is posting only 1024-byte
Receive buffers, and that the ib_sge for each Receive
operation is the same before and after the Receive is
posted (ie, the Receive ib_sge is valid and is not
getting overwritten somehow).

The wire traffic contains Send Only requests of 230 or so
bytes. If an ingress Send is too large, the Receive should
complete with IB_WC_LOC_LEN_ERR, not IB_WC_LOC_PROT_ERR.

The server disconnects due to the REM_OP_ERR. The
LOC_PROT_ERR completion appears to be the first Receive
completion after the QP is reconnected.

The client-side error completion on the Receive WR seems
to be a latent report of an earlier problem with an ingress
RDMA Read request.


>> My next step is to bisect. But first, I was wondering if
>> this behavior might be related to the recent problems with
>> s/g lists seen with iSER/SRP? ie, is this a recognized
>> issue?
>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux