Re: "memory management error" with NFS/RDMA on RoCE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> Hi Sagi-
>
> > On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> >
> >
> >> While running xfstests on an NFS/RDMA mount, I see this in
> >> the client's /var/log/messages multiple times:
> >> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >> As far as I can tell the client is able to recover and continue
> >> the test. However, this error is not supposed to happen in normal
> >> operation.
> >> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >
> > Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
>
>
> > What kernel version are you running?
>
> v4.12-rc2.
>
>
> > FW revision?
>
> 12.18.2000
>
>
> > Is the below commit applied?
>
> This commit does not appear to be applied to my kernel.
>
>
> > commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> > Author: Max Gurtovoy <maxg@xxxxxxxxxxxx>
> > Date:   Sun May 28 10:53:11 2017 +0300
> >
> >    RDMA/mlx5: set UMR wqe fence according to HCA cap
> >
> >    Cache the needed umr_fence and set the wqe ctrl segmennt
> >    accordingly.
> >
> >    Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
> >    Acked-by: Leon Romanovsky <leon@xxxxxxxxxx>
> >    Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
> >    Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
> >
> > This is the only thing that changed in that area
> > lately...
> >
> > Can you try without it?
>
> I haven't tried with it. I can pull it and see if it helps.
>
> I have tried:
>
> - with and without IOMMU enabled
> - with RoCE v1 and v2
> - with instrumentation:
>
> This can happen to any MR at any time after any number of
> uses. It does not appear to be "sticky" (ie, xprtrdma
> recovery from a memory management error clears the problem
> successfully by releasing the MR and allocating a new one).
>
> So it feels like a f/w or driver problem to me, at this
> point.

Jack and me discussed your issue tomorrow morning and we have strong
feeling that it is FW.

Thanks

>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux