Re: [PATCH 18/20] RDMA/mlx5: Improve PI handover performance

Max Gurtovoy <maxg@xxxxxxxxxxxx> · Thu, 6 Jun 2019 02:44:59 +0300

On 6/6/2019 1:52 AM, Sagi Grimberg wrote:

In some loads, there is performace

typo

 degradation when using KLM mkey
instead of MTT mkey. This is because KLM descriptor access is via
indirection that might require more HW resources and cycles.
Using KLM descriptor is not nessecery

typo

 when there are no gaps at the
data/metadata sg lists. As an optimization, use MTT mkey whenever it
is possible. For that matter, allocate internal MTT mkey and choose the
effective pi_mr for in transaction according to the required mapping
scheme.

You just doubled the number of resources (mrs+page_vectors) allocated
for a performance optimization (25% in large writes). I'm asking myself
if that is that acceptable? We tend to allocate a lot of those (not to
mention the target side).

We're using same amount of mkey's as before (sig + data + meta mkeys vs. 
sig + internal_klm + internal_mtt mkey).

And we save their invalidations (of the internal mkeys).

I'm not sure what is the correct answer here, I'm just wandering if this
is what we want to do. We have seen people bound by the max_mrs
limitation before, and this is making it worse (at least for the pi 
case).

it's not (see above). The limitation of mkeys is mostly with older HCA's 
that are not signature capable.

Anyways, just wanted to raise the concern. You guys are probably a lot
more familiar than I am on the usage patterns of this and if this is
a real problem or not...

+int mlx5_ib_map_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist 
*data_sg,
+             int data_sg_nents, unsigned int *data_sg_offset,
+             struct scatterlist *meta_sg, int meta_sg_nents,
+             unsigned int *meta_sg_offset)
+{
+    struct mlx5_ib_mr *mr = to_mmr(ibmr);
+    struct mlx5_ib_mr *pi_mr = mr->mtt_mr;
+    int n;
+
+    WARN_ON(ibmr->type != IB_MR_TYPE_INTEGRITY);
+
+    /*
+     * As a performance optimization, if possible, there is no need 
to map
+     * the sg lists to KLM descriptors. First try to map the sg 
lists to MTT
+     * descriptors and fallback to KLM only in case of a failure.
+     * It's more efficient for the HW to work with MTT descriptors
+     * (especially in high load).
+     * Use KLM (indirect access) only if it's mandatory.
+     */
+    n = mlx5_ib_map_mtt_mr_sg_pi(ibmr, data_sg, data_sg_nents,
+                     data_sg_offset, meta_sg, meta_sg_nents,
+                     meta_sg_offset);
+    if (n == data_sg_nents + meta_sg_nents)
+        goto out;
+
+    pi_mr = mr->klm_mr;
+    n = mlx5_ib_map_klm_mr_sg_pi(ibmr, data_sg, data_sg_nents,
+                     data_sg_offset, meta_sg, meta_sg_nents,
+                     meta_sg_offset);

Does this have any impact when all your I/O is gappy?

you mean trying to do a MTT mapping and fail ? I think it's a non-issue.

IIRC it was fairly easy to simulate that by running small block size
sequential I/O with an I/O scheduler (to a real device).

Would be interesting to measure the impact of the fallback?

I hope I'll have some spare time to add a flag to fio that will issue a 
gappy I/O always...

Maybe when I'll add an optimization to the SG_GAP MR (and not bound it 
to KLM as it now) and add it to NVMeoF/RDMA initiator stack.

Although I don't have any better suggestion other than signal
the application that you always want I/O without gaps (which poses
a different limitation)...

We can deal with gappy and non-gappy IO so application can send what 
ever makes it run fast.