Re: Upcoming libibverbs release

Yishai Hadas <yishaih@xxxxxxxxxxxxxxxxxx> · Sun, 26 Jun 2016 19:54:21 +0300

On 6/23/2016 7:51 PM, Doug Ledford wrote:
I'm preparing to make an official release of libibverbs soonish
(thinking by the end of next week).  Even though I took the timestamp v5
patchset, I know there were still comments from Jason on the v4 set that
weren't addressed in v5.  Those can be addressed (and should be) with
incremental patches prior to the official release.  Yishai, can we
please prioritize getting that done?

Listed below the main openings that were raised by Jason and our input 
on, considering their need and priority.

Most of already answered as part of the discussion, will summarize it 
all together to help evaluating the status/next steps.

> I'm still very much of the opinion that extended CQs should only allow
> compatible QP's to be added.
--------------------------------------

Application may create a CQ with large subset of flags that may fit some 
of its attached QP types. Semantically, it is similar to what we have 
today, it allows to put different QP types on the same CQ and by thus 
allow application to have ordered completion events.

We don't see a good reason to change from current behavior and hurt 
applications that expect that valid behavior.

> Add a compatibility layer
-------------------------------

A real application wants to achieve performance, to use a compatibility 
layer over a vendor that doesn’t support the new API, might be a very 
bad idea as of below.

1) The legacy API doesn't support the batching API that was introduced 
by the new API, this will lead to have a Door-Bell/PCI write per CQE as 
internally the compatibility layer will need to call ibv_poll_cq with 
size of 1.

2) There will 2 copies, one from the HW CQE to ibv_wc as part of the 
internal legacy API, second, from the ibv_wc to the read_xxx output, 
comparing the new API.

3) The *non-required* fields will be copied anyway as part of the 
internal legacy ibv_poll_cq from the HW CQE to ibv_wc and hits the 
performance.

In addition,
There is *no* one to one mapping in all the cases between the new API to 
the legacy one, it will lead to a *non* transparent usage of the 
compatibility layer and ends by returning an error to the application in 
few cases.

Specifically,
One of major benefits in the new API is the ability to say as part of CQ 
creation that it's going to be used by only one thread and have lock 
less flow as part of polling the CQ. This is not supported in the legacy 
API which assumes that multi-threaded access might be used.

It makes sense to return an error in the compatibility layer API letting 
application knows that this service is not supported when it was asked 
by the application.

Applications that look for performance expect to work in that single 
threaded mode and might fail as part of using the compatibility layer or 
alternatively have no idea that they didn't get the service which 
assumes to be bad behavior.

In addition, at the moment timestamping can’t be read via the legacy API 
means that the compatibility layer can’t handle it and should supply an 
error for. In the future there may be other fields/flows that can’t be 
supported by the compatibility layer so application that will be written 
over this layer will fail and won’t get the required benefit from using it.

To sum it up:
We don't see a real usage of such a layer, it has many limitations.
In case will be a real request from users, it can always be added in the 
future, minor priority for now.

> Exact set of access flags
------------------------------

Application can ask for subset of the available bits, in addition, it 
has an option to ask for all legacy ones (i.e. IBV_WC_STANDARD_FLAGS) 
for a case that a CQ needs to support all legacy attribute and could be 
attached to both RC and UD QPs.

We didn't find it useful at the moment to add other enums per QP type as 
application can say explicitly what it wants by using the ORed bits and 
even per QP type application may not be always interested in all the 
fields. (e.g. CQ that used for TX only completions in RC is not 
interested in few fields that are relevant for the RX).

To sum it up:
Low priority issue that can be added in the future based on users demand.

> Exact set of accessors
--------------------------------

At the moment the API introduces an option to get each field by a 
separate accessor (i.e ibv_read_wc_xxx), in addition, wr_id and status 
have direct access as application expect to use them in most of their flows.

During the discussion come up some ideas to add some extra accessors per 
QP type (e.g. ibv_wc_read_ud, ibv_wc_read_rc, etc.).

This option makes sense when application really plans to read and use 
all of the potential UD/RC fields, which *can't* be assumed as true in 
the general case.

Even in a UD/RC QP the application may be interested in few of the CQE 
fields while others are not needed.

From bench mark testing, we found that gathering few fields in one 
ibv_wc_read_xxx has low benefit comparing the case that each field is 
read by itself.

In a case that *not* all fields are really needed the extra copy to 
return the redundant fields found to be more expensive comparing the 
ibv_wc_read_xxx when needed.

We can think of an application that may be interested in subset of 
fields per QP type or per send/receive flows to be gathered in one call, 
this ability can always be added in the future as the API cleanly
enables extensions.

If you still think that adding some extra ibv_wc_read_xxx groups from 
day one is really needed, let's define it and add now by an incremental 
patch. Till now didn't see any user specific request for.

> Minor coding issues and micro optimizations
------------------------------------------------

Not sure what exactly that point refers to from API point of view, 
assuming that it relates to few comments given on libmlx5 as of below.

> +static inline uint32_t mlx5_cq_read_wc_qp_num(struct ibv_cq_ex *ibcq)
> +{
> +	struct mlx5_cq *cq = to_mcq(ibv_cq_ex_to_cq(ibcq));

> You might want to look at having the caller pass in 'cq' from a member
> in ibv_cq_ex. That way the caller can hold the cq in a register
> instead of constantly reloading it like this - I'm assuming to_mcq is
> not just an container_of??

The assumption is *incorrect*, the internal access from ibv_cq_ex to 
driver CQ is done via offsetof/container_of which is done in compile 
time, no need to have some change from the clean usage of the API to 
that mode.

> +static inline uint32_t mlx5_cq_read_wc_qp_num(struct ibv_cq_ex
> *ibcq)
> You need to go through everything you've written and make sure it is
> const-correct.
> You should also annotate with restrict.
> Doing this properly can increase performance by allowing the caller to
> avoid reloads.

From performance point of view we don't expect a change here, we 
followed the created assembly code and saw no difference even when it 
was annotated with 'restrict'.

In addition, please note the 'restrict' notation was only became 
standard as part of C99, in previous compilers it may not work at all or 
requires different key word to be defined as part of the CONFIG step.

We can go ahead and define the ibv_wc_read_xxx with const qualifier on 
its input struct ibv_cq_ex *ibcq in some extra incremental patch if 
makes sense.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html