Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb

Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> · Wed, 2 Mar 2016 12:51:38 -0700

On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
> 
> > So all this ugly API to minimize cache line usage has no measured
> > performance gain?
> 
> We have seen an increased cacheline footprint adding ~100-200ns to receive
> latencies during loops. This does not show up in synthetic loads that do
> not do much processing since their cache footprint is minimal.

I know you've seen effects..

But apparently the patch authors haven't.

> > > Does the opaque pointer guarantees an aligned access? Who allocates the
> > > space for the vendor's CQE? Any concrete example?
> > > One of the problems here are that CQEs could be formatted as -
> > > "if QP type is y, then copy the field z from o". Doing that this way may
> > > result doing the same "if" multiple times. The current approach could still
> > > avoid memcpy and write straight to the user's buffer.
> >
> > No, none of that...
> >
> > struct ibv_cq
> > {
> >     int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
> >     int (*read_address)(ibv_cq *cq,struct wr_address *res);
> >     uint64_t (*read_timestamp(ibv_cq *cq);
> >     uint32_t (*read_immediate_data(ibv_cq *cq);
> >     void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> > };
> 
> Argh. You are requiring multiple indirect function calls
> top retrieve the same imformation and therefore significantly increase
> latency.

It depends. These are unconditional branches and they elide alot of
other code and conditional branches by their design.

The mlx4 driver does something like this on every CQE to parse the
immediate data:

+	if (is_send) {
+		switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
+		case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

With this additional overhead for all the parse paths that have no
immediate:

+			if (wc_flags & IBV_WC_EX_WITH_IMM)
+				wc_buffer.b32++; // No imm to set

Whereas my suggestion would looke more like:

mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};

[and even that can be micro-optimized further]

And the setting of WITH_IMM during the common parse is much less branchy:

opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
                                            opcode == MLX4_RECV_OPCODE_SEND_IMM ...))

I bet that is a lot less work going on, even including the indirect
jump overhead.

> This is going to cause lots of problems for procesing at high
> speed where we have to use the processor caches as carefully as possible
> to squeeze out all we can get.

The driver should be built so the branch targets are all in close
cache lines, with function prologues deliberately elided so the branch
targets are small. There may even be further techniques to micro
optimize the jump, if one wanted to spend the effort.

Further, since we are only running the code the caller actually needs
we are not wasting icache lines trundling through the driver CQE parse
that has to handle all cases, even ones the caller doesn't care about.

It is possible this uses fewer icache lines than what we have today,
it depends how big the calls end up..

> > 4) A basic analysis says this trades cache line dirtying of the wc
> >    array for unconditional branches.
> >    It  eliminates at least 1 conditional branch per CQE iteration by
> >    using only 1 loop.
> 
> This done none of that stuff at all if the device directly follows the
> programmed format. There will be no need to do any driver formatting at
> all.

Maybe, but no such device exists? I'm not sure I want to see a
miserable API on the faint hope of hardware support..

Even if hardware appears, this basic API pattern can still support
that optimally by using the #5 technique - and still avoids the memcpy
from the DMA buffer to the user memory.

> >    Compared to the poll_ex proposal this eliminates a huge
> >    number of conditional branches as 'wc_flags' and related no longer
> >    exist at all.
> 
> wc_flags may be something bothersome. You do not want to check inside the
> loop.

Did you look at the mlx4 driver patch? It checks wc_flags constantly
when memcpying the HW CQE to the user memory - in the per CQE loop:

+			if (wc_flags & IBV_WC_EX_WITH_IMM) {
+				*wc_buffer.b32++ = cqe->immed_rss_invalid;

Every single user CQE write (or even not write) is guarded by a
conditional branch on the flags, and use of post-increment like that
creates critical path data dependencies and kills ILP. All this extra
code eats icache lines.

Sure, the scheme saves dritying cache lines, but at the cost of a huge
number of these conditional branches and more code. I'm deeply
skeptical that is an overall win compared to dirtying more cache
lines - mispredicted branches are expensive.

What I've suggested avoids all the cache line dirtying and avoids all
the above conditional branching and data dependencies at the cost of
a few indirect unconditional jumps. (and if #5 is possible they aren't
even jumps)

I say there is no way to tell which scheme performs better without
careful benchmarking - it is not obvious any of these trade offs are
winners.

> All cqe's should come with the fields requested and the
> layout of the data must be fixed when in the receive loop. No additional
> branches in the loop.

Sure, for the caller, I'm looking at this from the whole system
perspective. The driver is doing a wack more crap to produce this
formatting and it certainly isn't free.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html