Re: "memory management error" with NFS/RDMA on RoCE

Leon Romanovsky <leon@xxxxxxxxxx> · Wed, 5 Jul 2017 18:29:27 +0300

On Wed, Jul 05, 2017 at 10:40:41AM -0400, Chuck Lever wrote:
>
> > On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >
> > On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> >> Hi Sagi-
> >>
> >>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> >>>
> >>>
> >>>> While running xfstests on an NFS/RDMA mount, I see this in
> >>>> the client's /var/log/messages multiple times:
> >>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >>>> As far as I can tell the client is able to recover and continue
> >>>> the test. However, this error is not supposed to happen in normal
> >>>> operation.
> >>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >>>
> >>> Is this a regression?
> >>
> >> I can't answer that question with authority, because I just
> >> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> >> reported very similar symptoms with iSER on v4.9. It appears
> >> to have been around for a while, if these are the same.
> >>
> >>
> >>> What kernel version are you running?
> >>
> >> v4.12-rc2.
> >>
> >>
> >>> FW revision?
> >>
> >> 12.18.2000
> >>
> >>
> >>> Is the below commit applied?
> >>
> >> This commit does not appear to be applied to my kernel.
> >>
> >>
> >>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> >>> Author: Max Gurtovoy <maxg@xxxxxxxxxxxx>
> >>> Date:   Sun May 28 10:53:11 2017 +0300
> >>>
> >>>   RDMA/mlx5: set UMR wqe fence according to HCA cap
> >>>
> >>>   Cache the needed umr_fence and set the wqe ctrl segmennt
> >>>   accordingly.
> >>>
> >>>   Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
> >>>   Acked-by: Leon Romanovsky <leon@xxxxxxxxxx>
> >>>   Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
> >>>   Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
> >>>
> >>> This is the only thing that changed in that area
> >>> lately...
> >>>
> >>> Can you try without it?
> >>
> >> I haven't tried with it. I can pull it and see if it helps.
> >>
> >> I have tried:
> >>
> >> - with and without IOMMU enabled
> >> - with RoCE v1 and v2
> >> - with instrumentation:
> >>
> >> This can happen to any MR at any time after any number of
> >> uses. It does not appear to be "sticky" (ie, xprtrdma
> >> recovery from a memory management error clears the problem
> >> successfully by releasing the MR and allocating a new one).
> >>
> >> So it feels like a f/w or driver problem to me, at this
> >> point.
> >
> > Jack and me discussed your issue tomorrow morning and we have strong
> > feeling that it is FW.
>
> Hi Leon-
>
> Who is going to drive this issue to resolution? Do you need me
> to do something?

I don't think so, Jack was supposed to do it.

>
>
> > Thanks
> >
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
Attachment:
signature.asc

Description: PGP signature