On Sun, 9 Feb 2025 at 13:40, David Laight <david.laight.linux@xxxxxxxxx> wrote: > > Any idea what the one used to synchronise rdtsc should be? > 'lfence' is the right instruction (give or take), but it isn't > a speculation issue. > It really is 'wait for all memory accesses to finish' to give > a sensible(ish) answer for cycle timing. No, even that is actually very different. What happened was that 'lfence' was designed and documented - and named - as a memory fencing thing, but the *implementation* of it was basically about the front-end pipeline. IOW, ignore the name or the documentation. Think of "lfence" as a "this stops the pipeline until all previous instructions have retired". Because that is what it *is*. So it's basically a synchronization instruction *regardless* of memory accesses. Which is why it was then used for the rdtsc serialization - it basically says "don't *actually* read the TSC until you've finished everything you've started". And which is why it ended up being used for speculation control, even though the instructions it serializes are *not* necessarily memory accesses at all, but things like the address conditional that precedes it. So the speculation control use is literally "wait for the previous conditional branches to retire before continuing". Yes, the "continuing" tends to be a load, but that's almost incidental. > And on old cpu you want nothing - not a locked memory access. Well, back in the day, those locked instructions did the same thing. > I couldn't work out why __smp_mb() is so much stronger than the rmb() > and wmb() forms - I presume the is history there I wasn't looking for. So on x86, both read and write barriers are complete no-ops, because all reads are ordered, and all writes are ordered. So those only need compiler barriers to guarantee that the compiler itself doesn't re-order them. (Side note: earlier reads are also guaranteed to happen before later writes, so it's really only writes that can be delayed past reads, but we don't haev a barrier for that situation anyway. Also note that all of this is not "real" ordering, but only a guarantee that the user-visible semantics are AS IF they were actually ordered - if things are local in cache, ordering doesn't matter because no external CPU can *see* what the ordering was). So basically the only memory barriers that matter on x86 are the full "smp_mb()" that orders reads vs writes, and the ordering for non-ordered accesses used for IO. And then lfence is basically used for non-memory ordering too. Linus