Re: [PATCH rdma-core 07/14] mlx4: Update to use new udma write barriers

Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> · Thu, 9 Mar 2017 10:03:20 -0700

On Thu, Mar 09, 2017 at 05:42:19PM +0200, Yishai Hadas wrote:

> >The counter point is that the unlock macro can combine the WC flushing
> >barrier with the spinlock atomics, reducing the amount of global
> >fencing. If you remove the macro your remove that optimization.
> 
> The optimization is done as part of mmio_wc_spinlock() for X86, this macro
> is still used.

I'm talking about optimizing the unlock too.

The x86 definition probably requires the unlock to flush the WC
buffers just like for the lock, so avoiding the unlocking SFENCE
entirely will increase throughput further, but with a bit more latency
till flush.

> >Why not do this:
> >
> >-               mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl,
> >-                            align(size * 16, 64));
> >-
> >+               tmp_bf_offset = ctx->bf_offset;
> >                ctx->bf_offset ^= ctx->bf_buf_size;
> 
> The above 2 commands are still delaying the writing to the NIC comparing the
> original code where it was done in one command after mlx4_bf_copy().

It so simple, look at the assembly, my version eliminates an extra
load, for instance. So we get once 'cycle' better on throughput and
one cycle worse on latency to SFENCE.

But.. This routine was obviously never written to optimize latency to
SFENCE, eg why isn't '++qp->sq.head;' pushed to after the SFENCE if
that is so important? But again, pushing it after will improve
latency, hurt throughput.

.. and if you are so concerned about latency to SFENCE you should be
super excited about the barrier changes I sent you, as those will
improve that by a few cycles also.

I honestly think you are trying far too much to pointlessly preserve
the exact original code...

If you want to wreck the API like this, I would rather do it supported
by actual numbers. Add some TSC measurements in this area and see what
the different scenarios cost.

IMHO this stuff is already hard enough, having a simple to use,
symmetric API is more valuable.

> +/* Higher Level primitives */

If you are going to send this a patch please include my updated
comment.

> +/* Do mmio_wc_start and grab a spinlock */
> +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock)
> +{
> +       pthread_spin_lock(lock);
> +#if !defined(__i386__) && !defined(__x86_64__)
> +       /* For x86 the serialization within the spin lock is enough to
> +        * strongly order WC and other memory types. */
> +       mmio_wc_start();
> +#endif

I would like to see the unlock inline still present in the header for
clarity to the reader what the expected pattern is, and a comment in
mlx4/mlx5 indicating they are not using the unlock macro directly to
try and reduce latency to the flush.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html