On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > The non-save variant simply translated onto CLI and STI, which seems to > be very fast. It will depend on the microarchitecture. Happily: > The cost of save+restore when the irqs are already disabled is the same > (did a quick test). The really expensive part used to be P4. 'popf' was hundreds of cycles if any of the non-arithmetic bits changed, iirc. P4 used to be a big headache just because of things like that - straightforward code ran very well, but anything a bit more special took forever because it flushed the pipeline. So some of our optimizations may be historic because of things like that. We don't really need to worry about the P4 glass jaws any more, but it *used* to be much quicker to do 'preempt_disable()' that just does an add to a memory location than it was to disable interrupts. > Cannot remember who told me, but (apparently) the expensive part is > reading the CPU FLAGS. Again, it ends up being very dependent on the uarch. Reading and writing the flags register is somewhat expensive because it's not really "one" register in hardware any more (even if that was obviously the historical implementation). These days, the arithmetic flags are generally multiple renamed registers, and then the other flags are a separate system register (possibly multiple bits spread out). The cost of doing those flag reads and writes are hard to really specify, because in an OoO architecture a lot of it ends up being "how much of that can be done in parallel, and what's the pipeline serialization cost". Doing a loop with rdtsc is not necessarily AT ALL indicative of the cost when there is other real code around it. The cost _could_ be much smaller, in case there is little serialization with normal other code. Or, it could be much bigger than what a rdtsc shows, because if it's a hard pipeline flush, then a tight loop with those things won't have any real work to flush, while in "real code" there may be hundreds of instructions in flight and doing the flush is very expensive. The good news is that afaik, all the modern x86 CPU microarchitectures do reasonably well. And while a "pushf/cli/popf" sequence is probably more cycles than an add/subtract one in a benchmark, if the preempt counter is not otherwise needed, and is cold in the cache, then the pushf/cli/popf may be *much* cheaper than a cache miss. So the only way to really tell would be to run real benchmarks of real loads on multiple different microarchitectures. I'm pretty sure the actual result is: "you can't measure the 10-cycle difference on any modern core because it can actually go either way". But "I'm pretty sure" and "reality" are not the same thing. These days, pipeline flushes and cache misses (and then as a very particularly bad case - cache line pingpong issues) are almost the only thing that matters. And the most common reason by far for the pipeline flushes are branch mispredicts, but see above: the system bits in the flags register _have_ been cause of them in the past, so it's not entirely impossible. Linus