> On Aug 8, 2017, at 11:45 AM, Max Gurtovoy <maxg@xxxxxxxxxxxx> wrote: > > Hi all, > soory for late response. > > On 6/27/2017 5:56 PM, Chuck Lever wrote: >> Hi Sagi- >> >>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: >>> >>> >>>> While running xfstests on an NFS/RDMA mount, I see this in >>>> the client's /var/log/messages multiple times: >>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3 >>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78) >>>> As far as I can tell the client is able to recover and continue >>>> the test. However, this error is not supposed to happen in normal >>>> operation. >>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2. >>> >>> Is this a regression? >> >> I can't answer that question with authority, because I just >> started trying out NFS/RDMA on RoCE with mlx5. But Robert has >> reported very similar symptoms with iSER on v4.9. It appears >> to have been around for a while, if these are the same. >> >> >>> What kernel version are you running? >> >> v4.12-rc2. >> >> >>> FW revision? >> >> 12.18.2000 >> >> >>> Is the below commit applied? >> >> This commit does not appear to be applied to my kernel. >> >> >>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c >>> Author: Max Gurtovoy <maxg@xxxxxxxxxxxx> >>> Date: Sun May 28 10:53:11 2017 +0300 >>> >>> RDMA/mlx5: set UMR wqe fence according to HCA cap >>> >>> Cache the needed umr_fence and set the wqe ctrl segmennt >>> accordingly. >>> >>> Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx> >>> Acked-by: Leon Romanovsky <leon@xxxxxxxxxx> >>> Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx> >>> Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx> >>> >>> This is the only thing that changed in that area >>> lately... >>> >>> Can you try without it? >> >> I haven't tried with it. I can pull it and see if it helps. > > Chuck, > any updates using my patch above (actually you need this 1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ? My client is at v4.13-rc3 now, and I haven't seen this issue recur recently. >> I have tried: >> >> - with and without IOMMU enabled >> - with RoCE v1 and v2 >> - with instrumentation: >> >> This can happen to any MR at any time after any number of >> uses. It does not appear to be "sticky" (ie, xprtrdma >> recovery from a memory management error clears the problem >> successfully by releasing the MR and allocating a new one). >> > > I'm not so familiar with NFS/RDMA IO path yet, but are you using remote invalidation from server side or you run local invlidation ? > which side initiates the RDMA_READ/WRITE operations ? Remote Invalidation should be in use, but I haven't confirmed that. The storage target (the NFS server) issues the RDMA Read and RDMA Write operations. >> So it feels like a f/w or driver problem to me, at this >> point. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html