On 2/15/2017 5:38 PM, Sagi Grimberg wrote:
Tests have shown that the following error message is reported when
using SG-GAPS registration with an mlx5 adapter:
scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
ffff880bd4270eb0
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 0f007806 2500002a ad9fafd1
scsi host1: ib_srp: reconnect succeeded
mlx5_0:dump_cqe:262:(pid 7369): dump error cqe
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 0f007806 25000032 00105dd0
scsi host1: ib_srp: failed FAST REG status memory management operation
error (6) for CQE ffff880b92860138
Hence avoid using SG-GAPS memory registrations. Additionally,
always configure the blk_queue_virt_boundary() to avoid to trigger
a mapping failure when using adapters that support SG-GAPS (e.g.
mlx5).
Hi Guys,
Sorry for addressing this late, but has this failure been investigated?
Max, Israel, what does this error syndrome map to?
Sagi,
this syndrome says that number of klms to write is bigger than number of
mtts.
Artemy started investigating it and proposed solution that were tested
by Laurence.
Let's see if your fix will help.
Looking at mlx5_ib_sg_to_klms, I think the mr->length is incorrectly
incremented. Does the following change fix the problem?
--
diff --git a/drivers/infiniband/hw/mlx5/mr.c
b/drivers/infiniband/hw/mlx5/mr.c
index 8f608debe141..c21c9eee37f6 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1832,7 +1832,7 @@ mlx5_ib_sg_to_klms(struct mlx5_ib_mr *mr,
klms[i].va = cpu_to_be64(sg_dma_address(sg) + sg_offset);
klms[i].bcount = cpu_to_be32(sg_dma_len(sg) - sg_offset);
klms[i].key = cpu_to_be32(lkey);
- mr->ibmr.length += sg_dma_len(sg);
+ mr->ibmr.length += sg_dma_len(sg) - sg_offset;
sg_offset = 0;
}
--