thank you very much Alexander. > Date: Tue, 3 May 2022 12:09:32 +0300 (MSK) > From: Alexander Monakov <amonakov@xxxxxxxxx> > cc: gcc-help@xxxxxxxxxxx, stephane.glondu@xxxxxxxx, sibid@xxxxxxx > > On Tue, 3 May 2022, Paul Zimmermann via Gcc-help wrote: > > > Does anyone have a clue? > > I can reproduce a difference, but in my case it's simply because in -std=gnuXX > mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few > steps in the benchmarked function to use fma instead of separate mul/add > instructions. but then you should get better (i.e. smaller) timings with -std=gnuXX than with -std=cXX, instead of worse timings as we get? > (regarding __builtin_expect, it also makes a small difference in my case, > it seems GCC generates some redundant code without it, but the difference is > 10x smaller than what presence/absence of FMA gives) > > I think you might be able to figure it out on your end if you run both variants > under 'perf stat', note how cycle count and instruction counts change, and then > look at disassembly to see what changed. You can use 'perf record' and 'perf > report' to easily see the hot code path; if you do that, I'd recommend to run > it with the same sampling period in both cases, e.g. like this: > > perf record -e instructions:P -c 500000 ./perf ... thank you, we'll investigate that. Best regards, Paul