Re: "memory management error" with NFS/RDMA on RoCE

Max Gurtovoy <maxg@xxxxxxxxxxxx> · Tue, 8 Aug 2017 18:45:09 +0300

Hi all,
soory for late response.

On 6/27/2017 5:56 PM, Chuck Lever wrote:
Hi Sagi-

On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:

While running xfstests on an NFS/RDMA mount, I see this in
the client's /var/log/messages multiple times:
Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
As far as I can tell the client is able to recover and continue
the test. However, this error is not supposed to happen in normal
operation.
This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.

Is this a regression?

I can't answer that question with authority, because I just
started trying out NFS/RDMA on RoCE with mlx5. But Robert has
reported very similar symptoms with iSER on v4.9. It appears
to have been around for a while, if these are the same.

What kernel version are you running?

v4.12-rc2.

FW revision?

12.18.2000

Is the below commit applied?

This commit does not appear to be applied to my kernel.

commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
Author: Max Gurtovoy <maxg@xxxxxxxxxxxx>
Date:   Sun May 28 10:53:11 2017 +0300

   RDMA/mlx5: set UMR wqe fence according to HCA cap

   Cache the needed umr_fence and set the wqe ctrl segmennt
   accordingly.

   Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
   Acked-by: Leon Romanovsky <leon@xxxxxxxxxx>
   Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
   Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>

This is the only thing that changed in that area
lately...

Can you try without it?

I haven't tried with it. I can pull it and see if it helps.

Chuck,
any updates using my patch above (actually you need this 
1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ?

I have tried:

- with and without IOMMU enabled
- with RoCE v1 and v2
- with instrumentation:

This can happen to any MR at any time after any number of
uses. It does not appear to be "sticky" (ie, xprtrdma
recovery from a memory management error clears the problem
successfully by releasing the MR and allocating a new one).

I'm not so familiar with NFS/RDMA IO path yet, but are you using remote 
invalidation from server side or you run local invlidation ?
which side initiates the RDMA_READ/WRITE operations ?

So it feels like a f/w or driver problem to me, at this
point.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.kernel.org%2Fmajordomo-info.html&data=02%7C01%7Cmaxg%40mellanox.com%7C7dcc1137dc654001e88708d4bd6d0947%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636341723437870581&sdata=uSImRqWvZxrJ9Lu8MBykfeBpxFZwlF3J0XQHNBTgSlc%3D&reserved=0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html