Missing optimization on ARM NEON

Povilas Kanapickas <tir5c3@xxxxxxxxxxx> · Mon, 11 Nov 2013 02:57:17 +0200

Hello,

[ I don't have a way to test the described testcases against a newer
compiler: could someone verify whether this bug applies to the SVN
version of GCC? ]

GCC-4.8.1 misses several optimizations when using NEON intrinsics.
Consider the following snippet:

#include <arm_neon.h>

uint64_t* foo(uint64_t* x, uint32_t y)
{
    uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    vst1q_u64(x, d);
    x+=2;
    return x;
}

'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
assembly:

_Z3fooPyj:
	push	{r4, r5, r6, r7}
	vdup.32	q8, r1
	add	r7, r0, #32
	add	r6, r0, #48
	add	r5, r0, #64
	add	r4, r0, #80
	add	r1, r0, #96
	add	r2, r0, #112
	mov	r3, r0
	adds	r0, r0, #128
	vst1.64	{d16-d17}, [r3:64]!
	vst1.64	{d16-d17}, [r3:64]
	vst1.64	{d16-d17}, [r7:64]
	vst1.64	{d16-d17}, [r6:64]
	vst1.64	{d16-d17}, [r5:64]
	vst1.64	{d16-d17}, [r4:64]
	vst1.64	{d16-d17}, [r1:64]
	vst1.64	{d16-d17}, [r2:64]
	pop	{r4, r5, r6, r7}
	bx	lr

It's obvious that the GCC aproach is not optimal. The main problem is
that pointer autoincrement feature of the vst1.64 instruction is not
fully utilized. GCC apparently figures it out for the first store, but
it becomes confused later. I would expect GCC to produce the following
output:

_Z3fooPyj:
	vdup.32	q8, r1
        vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	bx	lr

On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version. Unfortunately I couldn't force GCC to generate it by any means
and had to use assembly.

Could someone verify whether the above bug ispresent in the SVN version?

Thanks,
Povilas