Re: [PATCH 3/4] x86,asm: Re-work smp_store_mb()

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Tue, 12 Jan 2016 19:45:27 +0200

On Tue, Jan 12, 2016 at 09:20:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 5:57 AM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> > #ifdef xchgrz
> > /* same as xchg but poking at gcc red zone */
> > #define barrier() do { int ret; asm volatile ("xchgl %0, -4(%%" SP ");": "=r"(ret) :: "memory", "cc"); } while (0)
> > #endif
> 
> That's not safe in general. gcc might be using its redzone, so doing
> xchg into it is unsafe.
> 
> But..
> 
> > Is this a good way to test it?
> 
> .. it's fine for some basic testing. It doesn't show any subtle
> interactions (ie some operations may have different dynamic behavior
> when the write buffers are busy etc), but as a baseline for "how fast
> can things go" the stupid raw loop is fine. And while the xchg into
> the redzoen wouldn't be acceptable as a real implementation, for
> timing testing it's likely fine (ie you aren't hitting the problem it
> can cause).
> 
> > So mfence is more expensive than locked instructions/xchg, but sfence/lfence
> > are slightly faster, and xchg and locked instructions are very close if
> > not the same.
> 
> Note that we never actually *use* lfence/sfence. They are pointless
> instructions when looking at CPU memory ordering, because for pure CPU
> memory ordering stores and loads are already ordered.
> 
> The only reason to use lfence/sfence is after you've used nontemporal
> stores for IO.

By the way, the comment in barrier.h says:

/*
 * Some non-Intel clones support out of order store. wmb() ceases to be
 * a nop for these.
 */

and while the 1st sentence may well be true, if you have
an SMP system with out of order stores, making wmb
not a nop will not help.

Additionally as you point out, wmb is not a nop even
for regular intel CPUs because of these weird use-cases.

Drop this comment?

> That's very very rare in the kernel. So I wouldn't
> worry about those.

Right - I'll leave these alone, whoever wants to optimize this path will
have to do the necessary research.

> But yes, it does sound like mfence is just a bad idea too.
> 
> > There isn't any extra magic behind mfence, is there?
> 
> No.
> 
> I think the only issue is that there has never been any real reason
> for CPU designers to try to make mfence go particularly fast. Nobody
> uses it, again with the exception of some odd loops that use
> nontemporal stores, and for those the cost tends to always be about
> the nontemporal accesses themselves (often to things like GPU memory
> over PCIe), and the mfence cost of a few extra cycles is negligible.
> 
> The reason "lock ; add $0" has generally been the fastest we've found
> is simply that locked ops have been important for CPU designers.
> 
> So I think the patch is fine, and we should likely drop the use of mfence..
> 
>                       Linus

OK so should I repost after a bit more testing?  I don't believe this
will affect the kernel build benchmark, but I'll try :)

-- 
MST
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization