Hello, [ I don't have a way to test the described testcases against a newer compiler: could someone verify whether this bug applies to the SVN version of GCC? ] GCC-4.8.1 misses several optimizations when using NEON intrinsics. Consider the following snippet: #include <arm_neon.h> uint64_t* foo(uint64_t* x, uint32_t y) { uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y)); vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; return x; } 'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following assembly: _Z3fooPyj: push {r4, r5, r6, r7} vdup.32 q8, r1 add r7, r0, #32 add r6, r0, #48 add r5, r0, #64 add r4, r0, #80 add r1, r0, #96 add r2, r0, #112 mov r3, r0 adds r0, r0, #128 vst1.64 {d16-d17}, [r3:64]! vst1.64 {d16-d17}, [r3:64] vst1.64 {d16-d17}, [r7:64] vst1.64 {d16-d17}, [r6:64] vst1.64 {d16-d17}, [r5:64] vst1.64 {d16-d17}, [r4:64] vst1.64 {d16-d17}, [r1:64] vst1.64 {d16-d17}, [r2:64] pop {r4, r5, r6, r7} bx lr It's obvious that the GCC aproach is not optimal. The main problem is that pointer autoincrement feature of the vst1.64 instruction is not fully utilized. GCC apparently figures it out for the first store, but it becomes confused later. I would expect GCC to produce the following output: _Z3fooPyj: vdup.32 q8, r1 vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! bx lr On unrolled loops GCC spills almost all registers to memory, which causes two to three times worse performance compared to the optimal version. Unfortunately I couldn't force GCC to generate it by any means and had to use assembly. Could someone verify whether the above bug ispresent in the SVN version? Thanks, Povilas