On 4/21/2020 3:20 PM, Christoph Hellwig wrote:
On Fri, Mar 27, 2020 at 08:15:36PM +0300, Max Gurtovoy wrote:
For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end
protection information passthrough and validation for NVMe over RDMA
transport. Metadata offload support was implemented over the new RDMA
signature verbs API and it is enabled per controller by using nvme-cli.
usage example:
nvme connect --pi_enable --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
Signed-off-by: Max Gurtovoy <maxg@xxxxxxxxxxxx>
Signed-off-by: Israel Rukshin <israelr@xxxxxxxxxxxx>
---
drivers/nvme/host/rdma.c | 330 ++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 296 insertions(+), 34 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index e38f8f7..23cc77e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -67,6 +67,9 @@ struct nvme_rdma_request {
struct ib_cqe reg_cqe;
struct nvme_rdma_queue *queue;
struct nvme_rdma_sgl data_sgl;
+ /* Metadata (T10-PI) support */
+ struct nvme_rdma_sgl *md_sgl;
+ bool use_md;
Do we need a use_md flag vs just using md_sgl as a boolean and/or
using blk_integrity_rq?
md_sgl will be used if we'll get a blk request with blk_integrity
(memory domain).
use_md will be responsible for wire domain.
so instead of this bool we can check in any place (after prev commit
changes):
"
if (queue->pi_support && nvme_ns_has_pi(ns))
req->use_md = c.common.opcode == nvme_cmd_write ||
c.common.opcode == nvme_cmd_read;
"
And this is less readable IMO.
enum nvme_rdma_queue_flags {
@@ -88,6 +91,7 @@ struct nvme_rdma_queue {
struct rdma_cm_id *cm_id;
int cm_error;
struct completion cm_done;
+ bool pi_support;
Why do we need this on a per-queue basis vs always checking the
controller?
To distinguish between admin and IO queues. I don't want to allocate PI
resources on admin queues and prefer not checking (idx &&
ctrl->pi_support) every time.
+ u32 max_page_list_len =
+ pi_support ? ibdev->attrs.max_pi_fast_reg_page_list_len :
+ ibdev->attrs.max_fast_reg_page_list_len;
+
+ return min_t(u32, NVME_RDMA_MAX_SEGMENTS, max_page_list_len - 1);
Can you use a good old if / else here?
sure.
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static void nvme_rdma_set_sig_domain(struct blk_integrity *bi,
+ struct nvme_command *cmd, struct ib_sig_domain *domain,
+ u16 control)
{
+ domain->sig_type = IB_SIG_TYPE_T10_DIF;
+ domain->sig.dif.bg_type = IB_T10DIF_CRC;
+ domain->sig.dif.pi_interval = 1 << bi->interval_exp;
+ domain->sig.dif.ref_tag = le32_to_cpu(cmd->rw.reftag);
/*
+ * At the moment we hard code those, but in the future
+ * we will take them from cmd.
I don't understand this comment. Also it doesn't use up all 80 chars.
It's a copy&paste from iSER.
I'll remove it.
+static void nvme_rdma_set_sig_attrs(struct blk_integrity *bi,
+ struct nvme_command *cmd, struct ib_sig_attrs *sig_attrs)
+{
+ u16 control = le16_to_cpu(cmd->rw.control);
+
+ WARN_ON(bi == NULL);
I think this WARN_ON is pointless, as we'll get a NULL pointer derference
a little later anyway.
I'll remove it.
+mr_put:
+ if (req->use_md)
+ ib_mr_pool_put(queue->qp, &queue->qp->sig_mrs, req->mr);
+ else
+ ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
I've seen this patterns a few times, maybe a little helper to return
the right mr pool for a request?
yes I'll add:
static void nvme_rdma_mr_pool_put(struct ib_qp *qp,
struct nvme_rdma_request *req)
{
if (req->use_md)
ib_mr_pool_put(qp, &qp->sig_mrs, req->mr);
else
ib_mr_pool_put(qp, &qp->rdma_mrs, req->mr);
req->mr = NULL;
}
+ if (blk_integrity_rq(rq)) {
+ memset(req->md_sgl, 0, sizeof(struct nvme_rdma_sgl));
Why do we need this memset?
just good practice we took from drivers/scsi/scsi_lib.c.
It's not a must and I can remove it if needed and test it.