On 08/02/2023 14:53, Jonathan Wakely via Gcc-help wrote:
On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@xxxxxxxxx> wrote:
On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
<gcc-help@xxxxxxxxxxx> wrote:
Hi all,
In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
void foo(int *a, const int *__restrict b, const int *__restrict c)
{
for (int i = 0; i < 16; i++) {
a[i] = b[i] + c[i];
}
}
I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.
I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
loop. So it's just an optimization choice at -O3 presumably based on
cost estimates that say that fully unrolling the loop will make the
code faster than looping.
There's nothing wrong with the loop unrolling. It's the use of space on
the stack that's the problem.
So it does look like GCC is making poor choices here.
It seems to be a regression between gcc 10 and gcc 11 (discovered by
changing the compiler on godbolt.org). With gcc 11 onwards, the
compiler seems to be using the stack to combine two 4-byte elements at a
time into a single 8-byte element. It's easy to see the effect by
changing the loop size to 2.
(I've no idea what causes the effect, or how to fix it - but knowing it
is a regression might make it easier for you to find.)