Hi Povilas,
I can confirm that the mainline arm gcc generates the similar code to
what you've observed. Can you please raise a bugzilla for this issue at
http://gcc.gnu.org/bugzilla/
Thanks,
Yufeng
On 11/11/13 00:57, Povilas Kanapickas wrote:
Hello,
[ I don't have a way to test the described testcases against a newer
compiler: could someone verify whether this bug applies to the SVN
version of GCC? ]
GCC-4.8.1 misses several optimizations when using NEON intrinsics.
Consider the following snippet:
#include<arm_neon.h>
uint64_t* foo(uint64_t* x, uint32_t y)
{
uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
vst1q_u64(x, d);
x+=2;
return x;
}
'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
assembly:
_Z3fooPyj:
push {r4, r5, r6, r7}
vdup.32 q8, r1
add r7, r0, #32
add r6, r0, #48
add r5, r0, #64
add r4, r0, #80
add r1, r0, #96
add r2, r0, #112
mov r3, r0
adds r0, r0, #128
vst1.64 {d16-d17}, [r3:64]!
vst1.64 {d16-d17}, [r3:64]
vst1.64 {d16-d17}, [r7:64]
vst1.64 {d16-d17}, [r6:64]
vst1.64 {d16-d17}, [r5:64]
vst1.64 {d16-d17}, [r4:64]
vst1.64 {d16-d17}, [r1:64]
vst1.64 {d16-d17}, [r2:64]
pop {r4, r5, r6, r7}
bx lr
It's obvious that the GCC aproach is not optimal. The main problem is
that pointer autoincrement feature of the vst1.64 instruction is not
fully utilized. GCC apparently figures it out for the first store, but
it becomes confused later. I would expect GCC to produce the following
output:
_Z3fooPyj:
vdup.32 q8, r1
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
vst1.64 {d16-d17}, [r0:64]!
bx lr
On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version. Unfortunately I couldn't force GCC to generate it by any means
and had to use assembly.
Could someone verify whether the above bug ispresent in the SVN version?
Thanks,
Povilas