On 17/07/10 07:45, Ajmal Ahammed wrote:
Thanks for your reply. Here is an example: #include<arm_neon.h> void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) { int i; uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28); n/=8; for (i=0; i<n; i++) { uint16x8_t temp; uint8x8x3_t rgb = vld3_u8 (src); uint8x8x3_t rgb1 = vld3_u8 (src+2); uint8x8_t result; temp = vmull_u8 (rgb1.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac); result = vshrn_n_u16 (temp, 8); vst1_u8 (dest, result); src += 8*3; dest += 8; } }
Ah! that problem! Yes, I'm afraid this is a known weakness in GCC's support of Neon at this time. We're looking into it. I think the problem stems from the way we expand the vld3_u8 intrinsic internally: this creates a copy of the object on the stack which the compiler is then unable to optimize away.
Note the compiler will tend to prefer using d16 upwards for Neon code because these are all scratch registers (which don't have to be saved before they can be re-used): it doesn't mean it can't use other registers, just it felt it didn't need to. The first problem I mention is probably masking the need to use more registers in this particular case.
R.