Re: Missing optimization on ARM NEON

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Povilas,

I can confirm that the mainline arm gcc generates the similar code to what you've observed. Can you please raise a bugzilla for this issue at http://gcc.gnu.org/bugzilla/

Thanks,
Yufeng

On 11/11/13 00:57, Povilas Kanapickas wrote:
Hello,

[ I don't have a way to test the described testcases against a newer
compiler: could someone verify whether this bug applies to the SVN
version of GCC? ]

GCC-4.8.1 misses several optimizations when using NEON intrinsics.
Consider the following snippet:

#include<arm_neon.h>

uint64_t* foo(uint64_t* x, uint32_t y)
{
     uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     return x;
}

'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
assembly:

_Z3fooPyj:
	push	{r4, r5, r6, r7}
	vdup.32	q8, r1
	add	r7, r0, #32
	add	r6, r0, #48
	add	r5, r0, #64
	add	r4, r0, #80
	add	r1, r0, #96
	add	r2, r0, #112
	mov	r3, r0
	adds	r0, r0, #128
	vst1.64	{d16-d17}, [r3:64]!
	vst1.64	{d16-d17}, [r3:64]
	vst1.64	{d16-d17}, [r7:64]
	vst1.64	{d16-d17}, [r6:64]
	vst1.64	{d16-d17}, [r5:64]
	vst1.64	{d16-d17}, [r4:64]
	vst1.64	{d16-d17}, [r1:64]
	vst1.64	{d16-d17}, [r2:64]
	pop	{r4, r5, r6, r7}
	bx	lr

It's obvious that the GCC aproach is not optimal. The main problem is
that pointer autoincrement feature of the vst1.64 instruction is not
fully utilized. GCC apparently figures it out for the first store, but
it becomes confused later. I would expect GCC to produce the following
output:

_Z3fooPyj:
	vdup.32	q8, r1
         vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	vst1.64	{d16-d17}, [r0:64]!
	bx	lr

On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version. Unfortunately I couldn't force GCC to generate it by any means
and had to use assembly.

Could someone verify whether the above bug ispresent in the SVN version?

Thanks,
Povilas








[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux