Re: Why does this unrolled function write to the stack?

Jonathan Wakely via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 8 Feb 2023 13:53:44 +0000

On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@xxxxxxxxx> wrote:
>
> On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
> <gcc-help@xxxxxxxxxxx> wrote:
> >
> > Hi all,
> >
> > In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
> >
> > void foo(int *a, const int *__restrict b, const int *__restrict c)
> > {
> >   for (int i = 0; i < 16; i++) {
> >     a[i] = b[i] + c[i];
> >   }
> > }
> >
> > I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.
>
> I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
> loop. So it's just an optimization choice at -O3 presumably based on
> cost estimates that say that fully unrolling the loop will make the
> code faster than looping.
>
> >
> > Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)?
>
> Benchmarking the function at different optimization levels I get:
>
> Run on (8 X 4500 MHz CPU s)
> CPU Caches:
>  L1 Data 32 KiB (x4)
>  L1 Instruction 32 KiB (x4)
>  L2 Unified 256 KiB (x4)
>  L3 Unified 8192 KiB (x1)
> Load Average: 0.14, 0.22, 0.39
> ***WARNING*** CPU scaling is enabled, the benchmark real time
> measurements may be noisy and will incur extra overhead.
> -----------------------------------------------------
> Benchmark           Time             CPU   Iterations
> -----------------------------------------------------
> O3               1.60 ns         1.60 ns    432901632
> O2               3.56 ns         3.56 ns    197086506
> O1               6.87 ns         6.86 ns    101839250
> Os               8.23 ns         8.22 ns     85273333
>
>
> Using quickbench:
> https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw

Oops, sorry, those were my original results *without* the -mno-avx
-mno-sse options! But that just shows that vectorization makes the
function fast.

Turning that off I get:

O3               58.3 ns         58.2 ns     11725604
O2               61.7 ns         61.6 ns     10930434
O1               7.37 ns         7.35 ns     95752192
Os               8.57 ns         8.56 ns     79448548

So it does look like GCC is making poor choices here.