Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 9 Feb 2025 13:57:24 -0800

On Sun, 9 Feb 2025 at 13:40, David Laight <david.laight.linux@xxxxxxxxx> wrote:
>
> Any idea what the one used to synchronise rdtsc should be?
> 'lfence' is the right instruction (give or take), but it isn't
> a speculation issue.
> It really is 'wait for all memory accesses to finish' to give
> a sensible(ish) answer for cycle timing.

No, even that is actually very different.

What happened was that 'lfence' was designed and documented - and
named - as a memory fencing thing, but the *implementation* of it was
basically about the front-end pipeline.

IOW, ignore the name or the documentation. Think of "lfence" as a
"this stops the pipeline until all previous instructions have
retired". Because that is what it *is*.

So it's basically a synchronization instruction *regardless* of memory accesses.

Which is why it was then used for the rdtsc serialization - it
basically says "don't *actually* read the TSC until you've finished
everything you've started".

And which is why it ended up being used for speculation control, even
though the instructions it serializes are *not* necessarily memory
accesses at all, but things like the address conditional that precedes
it.

So the speculation control use is literally "wait for the previous
conditional branches to retire before continuing". Yes, the
"continuing" tends to be a load, but that's almost incidental.

> And on old cpu you want nothing - not a locked memory access.

Well, back in the day, those locked instructions did the same thing.

> I couldn't work out why __smp_mb() is so much stronger than the rmb()
> and wmb() forms - I presume the is history there I wasn't looking for.

So on x86, both read and write barriers are complete no-ops, because
all reads are ordered, and all writes are ordered. So those only need
compiler barriers to guarantee that the compiler itself doesn't
re-order them.

(Side note: earlier reads are also guaranteed to happen before later
writes, so it's really only writes that can be delayed past reads, but
we don't haev a barrier for that situation anyway. Also note that all
of this is not "real" ordering, but only a guarantee that the
user-visible semantics are AS IF they were actually ordered - if
things are local in cache, ordering doesn't matter because no external
CPU can *see* what the ordering was).

So basically the only memory barriers that matter on x86 are the full
"smp_mb()" that orders reads vs writes, and the ordering for
non-ordered accesses used for IO.

And then lfence is basically used for non-memory ordering too.

                Linus