On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help <gcc-help@xxxxxxxxxxx> wrote: > > Hi all, > > In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows: > > void foo(int *a, const int *__restrict b, const int *__restrict c) > { > for (int i = 0; i < 16; i++) { > a[i] = b[i] + c[i]; > } > } > > I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary. I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a loop. So it's just an optimization choice at -O3 presumably based on cost estimates that say that fully unrolling the loop will make the code faster than looping. > > Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)? Benchmarking the function at different optimization levels I get: Run on (8 X 4500 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 8192 KiB (x1) Load Average: 0.14, 0.22, 0.39 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- O3 1.60 ns 1.60 ns 432901632 O2 3.56 ns 3.56 ns 197086506 O1 6.87 ns 6.86 ns 101839250 Os 8.23 ns 8.22 ns 85273333 Using quickbench: https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw