In some loads, there is performace
typo degradation when using KLM mkey
instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not nessecery
typo when there are no gaps at the
data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme.
You just doubled the number of resources (mrs+page_vectors) allocated for a performance optimization (25% in large writes). I'm asking myself if that is that acceptable? We tend to allocate a lot of those (not to mention the target side). I'm not sure what is the correct answer here, I'm just wandering if this is what we want to do. We have seen people bound by the max_mrs limitation before, and this is making it worse (at least for the pi case). Anyways, just wanted to raise the concern. You guys are probably a lot more familiar than I am on the usage patterns of this and if this is a real problem or not...
+int mlx5_ib_map_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg, + int data_sg_nents, unsigned int *data_sg_offset, + struct scatterlist *meta_sg, int meta_sg_nents, + unsigned int *meta_sg_offset) +{ + struct mlx5_ib_mr *mr = to_mmr(ibmr); + struct mlx5_ib_mr *pi_mr = mr->mtt_mr; + int n; + + WARN_ON(ibmr->type != IB_MR_TYPE_INTEGRITY); + + /* + * As a performance optimization, if possible, there is no need to map + * the sg lists to KLM descriptors. First try to map the sg lists to MTT + * descriptors and fallback to KLM only in case of a failure. + * It's more efficient for the HW to work with MTT descriptors + * (especially in high load). + * Use KLM (indirect access) only if it's mandatory. + */ + n = mlx5_ib_map_mtt_mr_sg_pi(ibmr, data_sg, data_sg_nents, + data_sg_offset, meta_sg, meta_sg_nents, + meta_sg_offset); + if (n == data_sg_nents + meta_sg_nents) + goto out; + + pi_mr = mr->klm_mr; + n = mlx5_ib_map_klm_mr_sg_pi(ibmr, data_sg, data_sg_nents, + data_sg_offset, meta_sg, meta_sg_nents, + meta_sg_offset);
Does this have any impact when all your I/O is gappy? IIRC it was fairly easy to simulate that by running small block size sequential I/O with an I/O scheduler (to a real device). Would be interesting to measure the impact of the fallback? Although I don't have any better suggestion other than signal the application that you always want I/O without gaps (which poses a different limitation)...