On Tue, May 12, 2020 at 10:00:59AM +0300, Yamin Friedman wrote: > > On 5/11/2020 3:08 PM, Yamin Friedman wrote: > > > > On 5/11/2020 8:07 AM, Leon Romanovsky wrote: > > > On Sun, May 10, 2020 at 05:55:55PM +0300, Yamin Friedman wrote: > > > > Allow a ULP to ask the core to provide a completion queue based on a > > > > least-used search on a per-device CQ pools. The device CQ pools > > > > grow in a > > > > lazy fashion when more CQs are requested. > > > > > > > > This feature reduces the amount of interrupts when using many QPs. > > > > Using shared CQs allows for more effcient completion handling. It also > > > > reduces the amount of overhead needed for CQ contexts. > > > > > > > > Test setup: > > > > Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers. > > > > Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch. > > > > TX-depth = 32. Number of cores refers to the initiator side. > > > > Four disks are > > > > accessed from each core. In the current case we have four CQs > > > > per core and > > > > in the shared case we have a single CQ per core. Until 14 cores > > > > there is no > > > > significant change in performance and the number of interrupts > > > > per second > > > > is less than a million in the current case. > > > > ================================================== > > > > |Cores|Current KIOPs |Shared KIOPs |improvement| > > > > |-----|---------------|--------------|-----------| > > > > |14 |2188 |2620 |19.7% | > > > > |-----|---------------|--------------|-----------| > > > > |20 |2063 |2308 |11.8% | > > > > |-----|---------------|--------------|-----------| > > > > |28 |1933 |2235 |15.6% | > > > > |================================================= > > > > |Cores|Current avg lat|Shared avg lat|improvement| > > > > |-----|---------------|--------------|-----------| > > > > |14 |817us |683us |16.4% | > > > > |-----|---------------|--------------|-----------| > > > > |20 |1239us |1108us |10.6% | > > > > |-----|---------------|--------------|-----------| > > > > |28 |1852us |1601us |13.5% | > > > > ======================================================== > > > > |Cores|Current interrupts|Shared interrupts|improvement| > > > > |-----|------------------|-----------------|-----------| > > > > |14 |2131K/sec |425K/sec |80% | > > > > |-----|------------------|-----------------|-----------| > > > > |20 |2267K/sec |594K/sec |73.8% | > > > > |-----|------------------|-----------------|-----------| > > > > |28 |2370K/sec |1057K/sec |55.3% | > > > > ==================================================================== > > > > |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement| > > > > |-----|------------------------|-----------------------|-----------| > > > > |14 |85Kus |9Kus |88% | > > > > |-----|------------------------|-----------------------|-----------| > > > > |20 |6Kus |5.3Kus |14.6% | > > > > |-----|------------------------|-----------------------|-----------| > > > > |28 |11.6Kus |9.5Kus |18% | > > > > |=================================================================== > > > > > > > > Performance improvement with 16 disks (16 CQs per core) is comparable. > > > > > > > > Signed-off-by: Yamin Friedman <yaminf@xxxxxxxxxxxx> > > > > Reviewed-by: Or Gerlitz <ogerlitz@xxxxxxxxxxxx> > > > > --- > > > > drivers/infiniband/core/core_priv.h | 8 ++ > > > > drivers/infiniband/core/cq.c | 145 > > > > ++++++++++++++++++++++++++++++++++++ > > > > drivers/infiniband/core/device.c | 3 +- > > > > include/rdma/ib_verbs.h | 32 ++++++++ > > > > 4 files changed, 187 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/infiniband/core/core_priv.h > > > > b/drivers/infiniband/core/core_priv.h > > > > index cf42acc..7fe9c13 100644 > > > > --- a/drivers/infiniband/core/core_priv.h > > > > +++ b/drivers/infiniband/core/core_priv.h > > > > @@ -191,6 +191,14 @@ static inline bool > > > > rdma_is_upper_dev_rcu(struct net_device *dev, > > > > return netdev_has_upper_dev_all_rcu(dev, upper); > > > > } > > > > > > > > +struct ib_cq *ib_cq_pool_get(struct ib_device *dev, unsigned > > > > int nr_cqe, > > > > + int cpu_hint, enum ib_poll_context poll_ctx); > > > > +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nr_cqe); > > > > + > > > > +void ib_init_cq_pools(struct ib_device *dev); > > > > + > > > > +void ib_purge_cq_pools(struct ib_device *dev); > > > I don't know how next patches compile to you, but "core_priv.h" is wrong > > > place to put function declarations. You also put them here and in > > > ib_verbs.h > > > below. > > > > > > Also, it will be nice if you will use same naming convention like in > > > mr_pool.h > > I will remove the use of core_priv and look into refactoring the names. > Init_cq_pools and purge_cq_pools are not exported functions they are for > internal core use, is ib_verbs really the place for them? I don't know, probably better to group them together because you will need to put ib_cq_pool_put/ib_cq_pool_get in the ib_verbs.h anyway. Thanks