Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb

Matan Barak <matanb@xxxxxxxxxxxx> · Mon, 21 Mar 2016 17:24:23 +0200

On 02/03/2016 21:51, Jason Gunthorpe wrote:
On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
On Wed, 2 Mar 2016, Jason Gunthorpe wrote:

So all this ugly API to minimize cache line usage has no measured
performance gain?

We have seen an increased cacheline footprint adding ~100-200ns to receive
latencies during loops. This does not show up in synthetic loads that do
not do much processing since their cache footprint is minimal.

I know you've seen effects..

But apparently the patch authors haven't.

Does the opaque pointer guarantees an aligned access? Who allocates the
space for the vendor's CQE? Any concrete example?
One of the problems here are that CQEs could be formatted as -
"if QP type is y, then copy the field z from o". Doing that this way may
result doing the same "if" multiple times. The current approach could still
avoid memcpy and write straight to the user's buffer.

No, none of that...

struct ibv_cq
{
     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
     int (*read_address)(ibv_cq *cq,struct wr_address *res);
     uint64_t (*read_timestamp(ibv_cq *cq);
     uint32_t (*read_immediate_data(ibv_cq *cq);
     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

Argh. You are requiring multiple indirect function calls
top retrieve the same imformation and therefore significantly increase
latency.

It depends. These are unconditional branches and they elide alot of
other code and conditional branches by their design.

Our performance team and I measured the new ibv_poll_cq_ex vs the old 
poll_cq implementation. We measured the number of cycles around the 
poll_cq call. We saw ~15 cycles increase for every CQE, which is about 
2% over the regular poll_cq (which is not extensible). We've used 
ib_write_lat in order to do these measurements.

Pinpointing the source of this increase, we found that the function 
pointer of our internal poll_one routine is the source of this. Our 
poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is 
*almost* tailored for the user required fields. Using a static poll_one 
will cause a lot of conditional branches and will decrease performance.
Meaning, using a function pointer vs using a static poll_one function 
(that the compiler is free to inline) causes this effect, but using a 
monolithic poll_one function will incur substantial computation overhead.

Using a per field getter introduces such a call (unconditional jump) for 
every field. If we were using static linking (+ whole program linkage), 
I agree this route is better. However, based on the results we saw 
earlier, we are worried this might incur tremendous overhead.
In addition, the user has to call read_next_cq for every completion 
entry (which is another unconditional jump).

The mlx4 driver does something like this on every CQE to parse the
immediate data:

+	if (is_send) {
+		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
+		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

With this additional overhead for all the parse paths that have no
immediate:

+			if (wc_flags & IBV_WC_EX_WITH_IMM)
+				wc_buffer.b32++; // No imm to set

Whereas my suggestion would looke more like:

mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};

[and even that can be micro-optimized further]

And the setting of WITH_IMM during the common parse is much less branchy:

opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
                                             opcode == MLX4_RECV_OPCODE_SEND_IMM ...))

I bet that is a lot less work going on, even including the indirect
jump overhead.

So we would have read the owner_sr_opcode several times (for almost 
every related wc field) and parse it accordingly or we'll have to store 
the pre-processed owner_sr_opcode in the CQ itself and incur the 
overhead of "copying" it. In addition, we'll need another API (end_cq) 
in order to indicate return-cq-element-to-hardware-ownership (and 
release vendor's per poll_cq locks if exist). So, as far as we 
understand, you propose something like:

struct ibv_cq
{
    /* more argument is new */
    int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more);
    /* function is new */
    int (*end_cq)(ibv_cq *cq);
    int (*read_address)(ibv_cq *cq,struct wr_address *res);
    uint64_t (*read_timestamp(ibv_cq *cq);
    uint32_t (*read_immediate_data(ibv_cq *cq);
    void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};

Again, every call here introduces an unconditional jump and makes the CQ 
structure larger. So, the CQ struct itself might not fit in the cache line.

In addition, ibv_cq isn't extensible as it's today. Addign a new 
ibv_cq_ex will change all APIs that use it.

This is going to cause lots of problems for procesing at high
speed where we have to use the processor caches as carefully as possible
to squeeze out all we can get.

The driver should be built so the branch targets are all in close
cache lines, with function prologues deliberately elided so the branch
targets are small. There may even be further techniques to micro
optimize the jump, if one wanted to spend the effort.

Further, since we are only running the code the caller actually needs
we are not wasting icache lines trundling through the driver CQE parse
that has to handle all cases, even ones the caller doesn't care about.

The current optimization we implement in our user-space drivers is to 
introduce multiple poll_one_ex functions, where every function assigns 
only some of the WC fields. Each function is tailored for a different 
use case and common scenario. By doing this, we don't assign fields that 
the user doesn't care about and we can avoid all conditional branches.

It is possible this uses fewer icache lines than what we have today,
it depends how big the calls end up..

The current proposal offers several poll_one_ex function - a function 
per common case. That's true we have more functions, but the function we 
actually used is *almost* tailored to the user's requirements.
All checks and conditions are done once. We trade-off the function 
pointer + re-checking/re-formatting some fields over and over again by 
copying the fields to ibv_wc_ex.

4) A basic analysis says this trades cache line dirtying of the wc
    array for unconditional branches.
    It  eliminates at least 1 conditional branch per CQE iteration by
    using only 1 loop.

This done none of that stuff at all if the device directly follows the
programmed format. There will be no need to do any driver formatting at
all.

Maybe, but no such device exists? I'm not sure I want to see a
miserable API on the faint hope of hardware support..

Even if hardware appears, this basic API pattern can still support
that optimally by using the #5 technique - and still avoids the memcpy
from the DMA buffer to the user memory.

    Compared to the poll_ex proposal this eliminates a huge
    number of conditional branches as 'wc_flags' and related no longer
    exist at all.

wc_flags may be something bothersome. You do not want to check inside the
loop.

Did you look at the mlx4 driver patch? It checks wc_flags constantly
when memcpying the HW CQE to the user memory - in the per CQE loop:

+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

Every single user CQE write (or even not write) is guarded by a
conditional branch on the flags, and use of post-increment like that
creates critical path data dependencies and kills ILP. All this extra
code eats icache lines.

>

Look at the new code for libmlx5. This code is optimized *at compile 
time*. We create a poll_one_ex function for a common case. This has zero 
overhead.

Regarding user-space check for wc_flags, I agree on eliminating this 
check. If you created a CQ with a set of attributes, they are guaranteed 
to exist on known values for the opcode.

Sure, the scheme saves dritying cache lines, but at the cost of a huge
number of these conditional branches and more code. I'm deeply
skeptical that is an overall win compared to dirtying more cache
lines - mispredicted branches are expensive.

What I've suggested avoids all the cache line dirtying and avoids all
the above conditional branching and data dependencies at the cost of
a few indirect unconditional jumps. (and if #5 is possible they aren't
even jumps)

When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's 
space application, conditional branching is eliminated in this proposal 
as well.

The question becomes - what costs more - copying the data or using an 
unconditional branch.

I say there is no way to tell which scheme performs better without
careful benchmarking - it is not obvious any of these trade offs are
winners.

Agree. It may depend on the user application as well.

All cqe's should come with the fields requested and the
layout of the data must be fixed when in the receive loop. No additional
branches in the loop.

Sure, for the caller, I'm looking at this from the whole system
perspective. The driver is doing a wack more crap to produce this
formatting and it certainly isn't free.

Dynamic WC format (or getter functions) isn't going to be free, at least 
not without linking the vendor's user-space driver to the application 
itself or making all vendors behave the same.

Jason

Yishai and Matan.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html