Re: [PATCH v2] rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors

Leon Romanovsky <leon@xxxxxxxxxx> · Mon, 29 Jul 2019 20:12:01 +0300

On Mon, Jul 29, 2019 at 10:24:12AM -0400, Chuck Lever wrote:
>
>
> > On Jul 29, 2019, at 1:49 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >
> > On Sun, Jul 28, 2019 at 12:30:27PM -0400, Chuck Lever wrote:
> >> Send and Receive completion is handled on a single CPU selected at
> >> the time each Completion Queue is allocated. Typically this is when
> >> an initiator instantiates an RDMA transport, or when a target
> >> accepts an RDMA connection.
> >>
> >> Some ULPs cannot open a connection per CPU to spread completion
> >> workload across available CPUs and MSI vectors. For such ULPs,
> >> provide an API that allows the RDMA core to select a completion
> >> vector based on the device's complement of available comp_vecs.
> >>
> >> ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted
> >> to use the new API so that their completion workloads interfere less
> >> with each other.
> >>
> >> Suggested-by: Håkon Bugge <haakon.bugge@xxxxxxxxxx>
> >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >> Cc: <linux-cifs@xxxxxxxxxxxxxxx>
> >> Cc: <v9fs-developer@xxxxxxxxxxxxxxxxxxxxx>
> >> ---
> >> drivers/infiniband/core/cq.c             |   29 +++++++++++++++++++++++++++++
> >> drivers/infiniband/ulp/srpt/ib_srpt.c    |    4 ++--
> >> fs/cifs/smbdirect.c                      |   10 ++++++----
> >> include/rdma/ib_verbs.h                  |   19 +++++++++++++++++++
> >> net/9p/trans_rdma.c                      |    6 +++---
> >> net/sunrpc/xprtrdma/svc_rdma_transport.c |    8 ++++----
> >> net/sunrpc/xprtrdma/verbs.c              |   13 ++++++-------
> >> 7 files changed, 69 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
> >> index 7c59987..ea3bb0d 100644
> >> --- a/drivers/infiniband/core/cq.c
> >> +++ b/drivers/infiniband/core/cq.c
> >> @@ -253,6 +253,35 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
> >> EXPORT_SYMBOL(__ib_alloc_cq_user);
> >>
> >> /**
> >> + * __ib_alloc_cq_any - allocate a completion queue
> >> + * @dev:		device to allocate the CQ for
> >> + * @private:		driver private data, accessible from cq->cq_context
> >> + * @nr_cqe:		number of CQEs to allocate
> >> + * @poll_ctx:		context to poll the CQ from.
> >> + * @caller:		module owner name.
> >> + *
> >> + * Attempt to spread ULP Completion Queues over each device's interrupt
> >> + * vectors.
> >> + */
> >> +struct ib_cq *__ib_alloc_cq_any(struct ib_device *dev, void *private,
> >> +				int nr_cqe, enum ib_poll_context poll_ctx,
> >> +				const char *caller)
> >> +{
> >> +	static atomic_t counter;
> >> +	int comp_vector;
> >
> > int comp_vector = 0;
> >
> >> +
> >> +	comp_vector = 0;
> >
> > This assignment is better to be part of initialization.
> >
> >> +	if (dev->num_comp_vectors > 1)
> >> +		comp_vector =
> >> +			atomic_inc_return(&counter) %
> >
> > Don't we need manage "free list" of comp_vectors? Otherwise we can find
> > ourselves providing already "taken" comp_vector.
>
> Many ULPs use only comp_vector 0 today. It is obviously harmless
> to have more than one ULP using the same comp_vector.
>
> The point of this patch is best effort spreading. This algorithm
> has been proposed repeatedly for several years on this list, and
> each time the consensus has been this is simple and good enough.

Agree, it is better than current implementation.

Reviewed-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>

Thanks