Re: Why does this unrolled function write to the stack?

Jonathan Wakely via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 8 Feb 2023 13:49:50 +0000

On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
<gcc-help@xxxxxxxxxxx> wrote:
>
> Hi all,
>
> In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
>
> void foo(int *a, const int *__restrict b, const int *__restrict c)
> {
>   for (int i = 0; i < 16; i++) {
>     a[i] = b[i] + c[i];
>   }
> }
>
> I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.

I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
loop. So it's just an optimization choice at -O3 presumably based on
cost estimates that say that fully unrolling the loop will make the
code faster than looping.

>
> Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)?

Benchmarking the function at different optimization levels I get:

Run on (8 X 4500 MHz CPU s)
CPU Caches:
 L1 Data 32 KiB (x4)
 L1 Instruction 32 KiB (x4)
 L2 Unified 256 KiB (x4)
 L3 Unified 8192 KiB (x1)
Load Average: 0.14, 0.22, 0.39
***WARNING*** CPU scaling is enabled, the benchmark real time
measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
O3               1.60 ns         1.60 ns    432901632
O2               3.56 ns         3.56 ns    197086506
O1               6.87 ns         6.86 ns    101839250
Os               8.23 ns         8.22 ns     85273333

Using quickbench:
https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw