Re: quality of code produced for arm cortex-A8

Richard Earnshaw <Richard.Earnshaw@xxxxxxxxxxxxxxxxxxxxxxx> · Sat, 17 Jul 2010 09:53:28 +0100

On 17/07/10 07:45, Ajmal Ahammed wrote:
Thanks for your reply.

Here is an example:

#include<arm_neon.h>
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
   int i;
   uint8x8_t rfac = vdup_n_u8 (77);
   uint8x8_t gfac = vdup_n_u8 (151);
   uint8x8_t bfac = vdup_n_u8 (28);
   n/=8;

   for (i=0; i<n; i++)
   {
     uint16x8_t  temp;
     uint8x8x3_t rgb  = vld3_u8 (src);
     uint8x8x3_t rgb1  = vld3_u8 (src+2);
     uint8x8_t result;

     temp = vmull_u8 (rgb1.val[0],      rfac);
     temp = vmlal_u8 (temp,rgb.val[1], gfac);
     temp = vmlal_u8 (temp,rgb.val[2], bfac);

     result = vshrn_n_u16 (temp, 8);
     vst1_u8 (dest, result);
     src  += 8*3;
     dest += 8;
   }
}

Ah! that problem!  Yes, I'm afraid this is a known weakness in GCC's 
support of Neon at this time.  We're looking into it.  I think the 
problem stems from the way we expand the vld3_u8 intrinsic internally: 
this creates a copy of the object on the stack which the compiler is 
then unable to optimize away.

Note the compiler will tend to prefer using d16 upwards for Neon code 
because these are all scratch registers (which don't have to be saved 
before they can be re-used): it doesn't mean it can't use other 
registers, just it felt it didn't need to.  The first problem I mention 
is probably masking the need to use more registers in this particular case.

R.