On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@xxxxxxxxx> wrote: > > On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help > <gcc-help@xxxxxxxxxxx> wrote: > > > > Hi all, > > > > In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows: > > > > void foo(int *a, const int *__restrict b, const int *__restrict c) > > { > > for (int i = 0; i < 16; i++) { > > a[i] = b[i] + c[i]; > > } > > } > > > > I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary. > > I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a > loop. So it's just an optimization choice at -O3 presumably based on > cost estimates that say that fully unrolling the loop will make the > code faster than looping. > > > > > Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)? > > Benchmarking the function at different optimization levels I get: > > Run on (8 X 4500 MHz CPU s) > CPU Caches: > L1 Data 32 KiB (x4) > L1 Instruction 32 KiB (x4) > L2 Unified 256 KiB (x4) > L3 Unified 8192 KiB (x1) > Load Average: 0.14, 0.22, 0.39 > ***WARNING*** CPU scaling is enabled, the benchmark real time > measurements may be noisy and will incur extra overhead. > ----------------------------------------------------- > Benchmark Time CPU Iterations > ----------------------------------------------------- > O3 1.60 ns 1.60 ns 432901632 > O2 3.56 ns 3.56 ns 197086506 > O1 6.87 ns 6.86 ns 101839250 > Os 8.23 ns 8.22 ns 85273333 > > > Using quickbench: > https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw Oops, sorry, those were my original results *without* the -mno-avx -mno-sse options! But that just shows that vectorization makes the function fast. Turning that off I get: O3 58.3 ns 58.2 ns 11725604 O2 61.7 ns 61.6 ns 10930434 O1 7.37 ns 7.35 ns 95752192 Os 8.57 ns 8.56 ns 79448548 So it does look like GCC is making poor choices here.