[PATCH] core: Fix a litte-endian bug in ARM svolume code

pmeerw@xxxxxxxxxx (Peter Meerwald) · Tue, 23 Oct 2012 14:28:45 +0200 (CEST)

Hello Arun,

> > > > checking NEON volume_float32ne
> > > > NEON: 10223 usec.
> > > > ref: 46480 usec.
> > > > checking NEON volume_s16ne
> > > > NEON: 8484 usec.
> > > > ARM: 339272 usec.
> > > > ref: 20203 usec.

I was testing with SAMPLES 1019; while you are likely testing with
SAMPLES 1022

Checking ARMv6 svolume (with 1019 samples)
func: 33923743 usec (min = 338868, max = 341919, stddev = 365.753).
orig: 2430664 usec (min = 24261, max = 24445, stddev = 42.2141).

Checking ARMv6 svolume (with 1022 samples)
func: 915036 usec (min = 9094, max = 9338, stddev = 50.2385).
orig: 2437988 usec (min = 24322, max = 24536, stddev = 48.1282).

> > > That's odd indeed. I have this on a Freescale i.mx53 (also Cortex A8)
> > > Checking ARM svolume
> > > func: 905150 usec (min = 9006, max = 9562, stddev = 76.1938).
> > > orig: 2278824 usec (min = 22760, max = 23252, stddev = 65.5575).

I get similar numbers with SAMPLES 1022 on a beagle-xm; I think you'll 
see catastrophic runtime with SAMPLES 1019

comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs, 
while ARM can do with one -- the ARM instructions (smulwb, ssat) look 
ideal for the svolume_s16 code

three observations:

(1) when the number of samples is odd, the ARM code processes the first 
sample before switching to the unrolled 4-samples-at-a-time loop; this 
causes the samples pointer to become misaligned (2-byte align) (assuming 
it was 4-byte aligned initially)

I am not sure what guarantees PulseAudio gives on buffer alignment 

(2) the NEON code generally fails when input data length < 4; can be 
easily fixed

(3) neither ARM nor NEON code cares about alignment; just the strategy is 
different

ARM handles cases where length % 3 != 0 first (before entering the 
unrolled loop); which is bad when the sample buffer is aligned
NEON takes care of length % 3 != 0 for the last samples; which is good 
when the smaple buffer is aligned

> > # ./cpu-test 
> > Running suite(s): CPU
> > CPU flags: V6 V7 VFP EDSP NEON VFPV3 Cortex-A8 
> > Initialising ARM optimized volume functions.
> > Checking ARM svolume
> > 0: 1ac8 != 390e (43e9 * 0000d716)
> > Orc not supported. Skipping
> > 50%: Checks: 2, Failures: 1, Errors: 0
> > tests/cpu-test.c:52:F:svolume:svolume_arm_test:0: Failed

> Does this include the little-endianness fix?

my fault; took the latest source, but failed to make sure that the proper 
.so was linked -> it works with little-endian fixes actually deployed

> My current testing shows NEON svolume code with int16 samples
> consistently slower than the ARM code (tried on the Pandaboard, i.mx51,
> i,mx53, imx.6) by ~10% in most cases.

I agree; I think the ARM code is pretty good for s16 handling plus I think 
enabling NEON is expensive power-wise -- so svolume_s16_neon should be 
dropped and svolume_s16_arm should be improved to handle odd buffers 
nicely

I just ignored the ARM code before since I was always testing with SAMPLES 
1019 this hitting the worst case runtime wise

regards, p.

-- 

Peter Meerwald
+43-664-2444418 (mobile)