On Tue, 3 May 2022, Paul Zimmermann via Gcc-help wrote: > Does anyone have a clue? I can reproduce a difference, but in my case it's simply because in -std=gnuXX mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few steps in the benchmarked function to use fma instead of separate mul/add instructions. (regarding __builtin_expect, it also makes a small difference in my case, it seems GCC generates some redundant code without it, but the difference is 10x smaller than what presence/absence of FMA gives) I think you might be able to figure it out on your end if you run both variants under 'perf stat', note how cycle count and instruction counts change, and then look at disassembly to see what changed. You can use 'perf record' and 'perf report' to easily see the hot code path; if you do that, I'd recommend to run it with the same sampling period in both cases, e.g. like this: perf record -e instructions:P -c 500000 ./perf ... Alexander