On Tue, 2012-10-23 at 15:48 +0200, Peter Meerwald wrote: > Hello myself, > > > comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs, > > while ARM can do with one -- the ARM instructions (smulwb, ssat) look > > ideal for the svolume_s16 code > > for the records, NEON can also do it with one MUL: > > static inline void vol_s16_neon(const uint32x4_t *vol4, int16_t *samples, unsigned length) { > asm volatile ( > "mov %[length], %[length], lsr #2\n\t" > "vld1.s32 {q1}, [%[vol]]\n\t" > "1:\n\t" > "vld1.16 {d0}, [%[samples]]\n\t" > "vshll.s16 q0, d0, #15\n\t" > "vqdmulhq.s32 q0, q0, q1\n\t" > "vmovn.s32 d0, q0\n\t" > "subs %[length], %[length], #1\n\t" > "vst1.16 {d0}, [%[samples]]!\n\t" > "bgt 1b\n\t" > /* output operands (or input operands that get modified) */ > : [samples] "+r" (samples), [length] "+r" (length) > : [vol] "r" (vol4) /* input operands */ > : "memory", "cc", "q0", "q1" /* clobber list */ > ); > } > > Checking ARM NEON svolume > func: 1291289 usec (min = 12817, max = 13184, stddev = 65.9113). > orig: 2438875 usec (min = 24322, max = 25605, stddev = 130.359). > Orc not supported. Skipping > 100%: Checks: 3, Failures: 0, Errors: 0 > > this is a bit better than the previous NEON code (~1300000 vs. ~1510000), > but still slower than ARM (~920000) Nice catch on the alignment. I'm trying to extend our tests to catch these cases. A couple of notes: R?mi Denis-Courmont mentions that you will likely see performance benefits in the NEON code by sprinkling in some preloads (PLD). I've also factored out the sconv code and that does provide a win on all the boards I tried. To get this moving for 3.0, could you respin just the sconv patches on top of master (I'll push out my testing code soon) so that we can push that bit out first while we work on the others? Cheers, Arun