[PATCH] core: Fix a litte-endian bug in ARM svolume code

pmeerw@xxxxxxxxxx (Peter Meerwald) · Tue, 23 Oct 2012 15:48:23 +0200 (CEST)

Hello myself,

> comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs, 
> while ARM can do with one -- the ARM instructions (smulwb, ssat) look 
> ideal for the svolume_s16 code

for the records, NEON can also do it with one MUL:

static inline void vol_s16_neon(const uint32x4_t *vol4, int16_t *samples, unsigned length) {
    asm volatile (
    "mov        %[length], %[length], lsr #2\n\t"
    "vld1.s32   {q1}, [%[vol]]\n\t"
    "1:\n\t"
    "vld1.16	{d0}, [%[samples]]\n\t"
    "vshll.s16  q0, d0, #15\n\t"
    "vqdmulhq.s32 q0, q0, q1\n\t"
    "vmovn.s32  d0, q0\n\t"
    "subs       %[length], %[length], #1\n\t"
    "vst1.16	{d0}, [%[samples]]!\n\t"
    "bgt        1b\n\t"
      /* output operands (or input operands that get modified) */
    : [samples] "+r" (samples), [length] "+r" (length)
    : [vol] "r" (vol4) /* input operands */
    : "memory", "cc", "q0", "q1" /* clobber list */
    );
}

Checking ARM NEON svolume
func: 1291289 usec (min = 12817, max = 13184, stddev = 65.9113).
orig: 2438875 usec (min = 24322, max = 25605, stddev = 130.359).
Orc not supported. Skipping
100%: Checks: 3, Failures: 0, Errors: 0

this is a bit better than the previous NEON code (~1300000 vs. ~1510000), 
but still slower than ARM (~920000)

regards, p.

-- 

Peter Meerwald
+43-664-2444418 (mobile)