On Thu, Mar 09, 2017 at 05:42:19PM +0200, Yishai Hadas wrote: > >The counter point is that the unlock macro can combine the WC flushing > >barrier with the spinlock atomics, reducing the amount of global > >fencing. If you remove the macro your remove that optimization. > > The optimization is done as part of mmio_wc_spinlock() for X86, this macro > is still used. I'm talking about optimizing the unlock too. The x86 definition probably requires the unlock to flush the WC buffers just like for the lock, so avoiding the unlocking SFENCE entirely will increase throughput further, but with a bit more latency till flush. > >Why not do this: > > > >- mlx4_bf_copy(ctx->bf_page + ctx->bf_offset, (unsigned long *) ctrl, > >- align(size * 16, 64)); > >- > >+ tmp_bf_offset = ctx->bf_offset; > > ctx->bf_offset ^= ctx->bf_buf_size; > > The above 2 commands are still delaying the writing to the NIC comparing the > original code where it was done in one command after mlx4_bf_copy(). It so simple, look at the assembly, my version eliminates an extra load, for instance. So we get once 'cycle' better on throughput and one cycle worse on latency to SFENCE. But.. This routine was obviously never written to optimize latency to SFENCE, eg why isn't '++qp->sq.head;' pushed to after the SFENCE if that is so important? But again, pushing it after will improve latency, hurt throughput. .. and if you are so concerned about latency to SFENCE you should be super excited about the barrier changes I sent you, as those will improve that by a few cycles also. I honestly think you are trying far too much to pointlessly preserve the exact original code... If you want to wreck the API like this, I would rather do it supported by actual numbers. Add some TSC measurements in this area and see what the different scenarios cost. IMHO this stuff is already hard enough, having a simple to use, symmetric API is more valuable. > +/* Higher Level primitives */ If you are going to send this a patch please include my updated comment. > +/* Do mmio_wc_start and grab a spinlock */ > +static inline void mmio_wc_spinlock(pthread_spinlock_t *lock) > +{ > + pthread_spin_lock(lock); > +#if !defined(__i386__) && !defined(__x86_64__) > + /* For x86 the serialization within the spin lock is enough to > + * strongly order WC and other memory types. */ > + mmio_wc_start(); > +#endif I would like to see the unlock inline still present in the header for clarity to the reader what the expected pattern is, and a comment in mlx4/mlx5 indicating they are not using the unlock macro directly to try and reduce latency to the flush. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html