Hello myself, > comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs, > while ARM can do with one -- the ARM instructions (smulwb, ssat) look > ideal for the svolume_s16 code for the records, NEON can also do it with one MUL: static inline void vol_s16_neon(const uint32x4_t *vol4, int16_t *samples, unsigned length) { asm volatile ( "mov %[length], %[length], lsr #2\n\t" "vld1.s32 {q1}, [%[vol]]\n\t" "1:\n\t" "vld1.16 {d0}, [%[samples]]\n\t" "vshll.s16 q0, d0, #15\n\t" "vqdmulhq.s32 q0, q0, q1\n\t" "vmovn.s32 d0, q0\n\t" "subs %[length], %[length], #1\n\t" "vst1.16 {d0}, [%[samples]]!\n\t" "bgt 1b\n\t" /* output operands (or input operands that get modified) */ : [samples] "+r" (samples), [length] "+r" (length) : [vol] "r" (vol4) /* input operands */ : "memory", "cc", "q0", "q1" /* clobber list */ ); } Checking ARM NEON svolume func: 1291289 usec (min = 12817, max = 13184, stddev = 65.9113). orig: 2438875 usec (min = 24322, max = 25605, stddev = 130.359). Orc not supported. Skipping 100%: Checks: 3, Failures: 0, Errors: 0 this is a bit better than the previous NEON code (~1300000 vs. ~1510000), but still slower than ARM (~920000) regards, p. -- Peter Meerwald +43-664-2444418 (mobile)