Re: [PATCH 18/20] RDMA/mlx5: Improve PI handover performance

Sagi Grimberg <sagi@xxxxxxxxxxx> · Wed, 5 Jun 2019 15:52:43 -0700

In some loads, there is performace

typo

 degradation when using KLM mkey
instead of MTT mkey. This is because KLM descriptor access is via
indirection that might require more HW resources and cycles.
Using KLM descriptor is not nessecery

typo

 when there are no gaps at the
data/metadata sg lists. As an optimization, use MTT mkey whenever it
is possible. For that matter, allocate internal MTT mkey and choose the
effective pi_mr for in transaction according to the required mapping
scheme.

You just doubled the number of resources (mrs+page_vectors) allocated
for a performance optimization (25% in large writes). I'm asking myself
if that is that acceptable? We tend to allocate a lot of those (not to
mention the target side).

I'm not sure what is the correct answer here, I'm just wandering if this
is what we want to do. We have seen people bound by the max_mrs
limitation before, and this is making it worse (at least for the pi case).

Anyways, just wanted to raise the concern. You guys are probably a lot
more familiar than I am on the usage patterns of this and if this is
a real problem or not...

+int mlx5_ib_map_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
+			 int data_sg_nents, unsigned int *data_sg_offset,
+			 struct scatterlist *meta_sg, int meta_sg_nents,
+			 unsigned int *meta_sg_offset)
+{
+	struct mlx5_ib_mr *mr = to_mmr(ibmr);
+	struct mlx5_ib_mr *pi_mr = mr->mtt_mr;
+	int n;
+
+	WARN_ON(ibmr->type != IB_MR_TYPE_INTEGRITY);
+
+	/*
+	 * As a performance optimization, if possible, there is no need to map
+	 * the sg lists to KLM descriptors. First try to map the sg lists to MTT
+	 * descriptors and fallback to KLM only in case of a failure.
+	 * It's more efficient for the HW to work with MTT descriptors
+	 * (especially in high load).
+	 * Use KLM (indirect access) only if it's mandatory.
+	 */
+	n = mlx5_ib_map_mtt_mr_sg_pi(ibmr, data_sg, data_sg_nents,
+				     data_sg_offset, meta_sg, meta_sg_nents,
+				     meta_sg_offset);
+	if (n == data_sg_nents + meta_sg_nents)
+		goto out;
+
+	pi_mr = mr->klm_mr;
+	n = mlx5_ib_map_klm_mr_sg_pi(ibmr, data_sg, data_sg_nents,
+				     data_sg_offset, meta_sg, meta_sg_nents,
+				     meta_sg_offset);

Does this have any impact when all your I/O is gappy?

IIRC it was fairly easy to simulate that by running small block size
sequential I/O with an I/O scheduler (to a real device).

Would be interesting to measure the impact of the fallback?

Although I don't have any better suggestion other than signal
the application that you always want I/O without gaps (which poses
a different limitation)...