Hi Sagi- > On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: > > >> While running xfstests on an NFS/RDMA mount, I see this in >> the client's /var/log/messages multiple times: >> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 >> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3 >> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78) >> As far as I can tell the client is able to recover and continue >> the test. However, this error is not supposed to happen in normal >> operation. >> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2. > > Is this a regression? I can't answer that question with authority, because I just started trying out NFS/RDMA on RoCE with mlx5. But Robert has reported very similar symptoms with iSER on v4.9. It appears to have been around for a while, if these are the same. > What kernel version are you running? v4.12-rc2. > FW revision? 12.18.2000 > Is the below commit applied? This commit does not appear to be applied to my kernel. > commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c > Author: Max Gurtovoy <maxg@xxxxxxxxxxxx> > Date: Sun May 28 10:53:11 2017 +0300 > > RDMA/mlx5: set UMR wqe fence according to HCA cap > > Cache the needed umr_fence and set the wqe ctrl segmennt > accordingly. > > Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx> > Acked-by: Leon Romanovsky <leon@xxxxxxxxxx> > Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx> > Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx> > > This is the only thing that changed in that area > lately... > > Can you try without it? I haven't tried with it. I can pull it and see if it helps. I have tried: - with and without IOMMU enabled - with RoCE v1 and v2 - with instrumentation: This can happen to any MR at any time after any number of uses. It does not appear to be "sticky" (ie, xprtrdma recovery from a memory management error clears the problem successfully by releasing the MR and allocating a new one). So it feels like a f/w or driver problem to me, at this point. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html