Re: [PATCH v1] rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors

Leon Romanovsky <leon@xxxxxxxxxx> · Wed, 24 Jul 2019 08:47:36 +0300

On Tue, Jul 23, 2019 at 03:13:37PM -0400, Chuck Lever wrote:
> Send and Receive completion is handled on a single CPU selected at
> the time each Completion Queue is allocated. Typically this is when
> an initiator instantiates an RDMA transport, or when a target
> accepts an RDMA connection.
>
> Some ULPs cannot open a connection per CPU to spread completion
> workload across available CPUs. For these ULPs, allow the RDMA core
> to select a completion vector based on the device's complement of
> available comp_vecs.
>
> When a ULP elects to use RDMA_CORE_ANY_COMPVEC, if multiple CPUs are
> available, a different CPU will be selected for each Completion
> Queue. For the moment, a simple round-robin mechanism is used.
>
> Suggested-by: Håkon Bugge <haakon.bugge@xxxxxxxxxx>
> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>

It make me wonder why do we need comp_vector as an argument to ib_alloc_cq?
>From what I see, or callers are internally implementing similar logic
to proposed here, or they don't care (set 0).

Can we enable this comp_vector for everyone and simplify our API?

> ---
>  drivers/infiniband/core/cq.c             |   20 +++++++++++++++++++-
>  include/rdma/ib_verbs.h                  |    3 +++
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    6 ++++--
>  net/sunrpc/xprtrdma/verbs.c              |    5 ++---
>  4 files changed, 28 insertions(+), 6 deletions(-)
>
> Jason-
>
> If this patch is acceptable to all, then I would expect you to take
> it through the RDMA tree.
>
>
> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> index 7c599878ccf7..a89d549490c4 100644
> --- a/drivers/infiniband/core/cq.c
> +++ b/drivers/infiniband/core/cq.c
> @@ -165,12 +165,27 @@ static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>  	queue_work(cq->comp_wq, &cq->work);
>  }
>
> +/*
> + * Attempt to spread ULP completion queues over a device's completion
> + * vectors so that all available CPU cores can help service the device's
> + * interrupt workload. This mechanism may be improved at a later point
> + * to dynamically take into account the system's actual workload.
> + */
> +static int ib_get_comp_vector(struct ib_device *dev)
> +{
> +	static atomic_t cv;
> +
> +	if (dev->num_comp_vectors > 1)
> +		return atomic_inc_return(&cv) % dev->num_comp_vectors;

It is worth to take into account num_online_cpus(),

Thanks