Hello Arun, > > > > checking NEON volume_float32ne > > > > NEON: 10223 usec. > > > > ref: 46480 usec. > > > > checking NEON volume_s16ne > > > > NEON: 8484 usec. > > > > ARM: 339272 usec. > > > > ref: 20203 usec. I was testing with SAMPLES 1019; while you are likely testing with SAMPLES 1022 Checking ARMv6 svolume (with 1019 samples) func: 33923743 usec (min = 338868, max = 341919, stddev = 365.753). orig: 2430664 usec (min = 24261, max = 24445, stddev = 42.2141). Checking ARMv6 svolume (with 1022 samples) func: 915036 usec (min = 9094, max = 9338, stddev = 50.2385). orig: 2437988 usec (min = 24322, max = 24536, stddev = 48.1282). > > > That's odd indeed. I have this on a Freescale i.mx53 (also Cortex A8) > > > Checking ARM svolume > > > func: 905150 usec (min = 9006, max = 9562, stddev = 76.1938). > > > orig: 2278824 usec (min = 22760, max = 23252, stddev = 65.5575). I get similar numbers with SAMPLES 1022 on a beagle-xm; I think you'll see catastrophic runtime with SAMPLES 1019 comparing ARM vs. NEON code, the svolume s16 NEON code uses two MULs, while ARM can do with one -- the ARM instructions (smulwb, ssat) look ideal for the svolume_s16 code three observations: (1) when the number of samples is odd, the ARM code processes the first sample before switching to the unrolled 4-samples-at-a-time loop; this causes the samples pointer to become misaligned (2-byte align) (assuming it was 4-byte aligned initially) I am not sure what guarantees PulseAudio gives on buffer alignment (2) the NEON code generally fails when input data length < 4; can be easily fixed (3) neither ARM nor NEON code cares about alignment; just the strategy is different ARM handles cases where length % 3 != 0 first (before entering the unrolled loop); which is bad when the sample buffer is aligned NEON takes care of length % 3 != 0 for the last samples; which is good when the smaple buffer is aligned > > # ./cpu-test > > Running suite(s): CPU > > CPU flags: V6 V7 VFP EDSP NEON VFPV3 Cortex-A8 > > Initialising ARM optimized volume functions. > > Checking ARM svolume > > 0: 1ac8 != 390e (43e9 * 0000d716) > > Orc not supported. Skipping > > 50%: Checks: 2, Failures: 1, Errors: 0 > > tests/cpu-test.c:52:F:svolume:svolume_arm_test:0: Failed > Does this include the little-endianness fix? my fault; took the latest source, but failed to make sure that the proper .so was linked -> it works with little-endian fixes actually deployed > My current testing shows NEON svolume code with int16 samples > consistently slower than the ARM code (tried on the Pandaboard, i.mx51, > i,mx53, imx.6) by ~10% in most cases. I agree; I think the ARM code is pretty good for s16 handling plus I think enabling NEON is expensive power-wise -- so svolume_s16_neon should be dropped and svolume_s16_arm should be improved to handle odd buffers nicely I just ignored the ARM code before since I was always testing with SAMPLES 1019 this hitting the worst case runtime wise regards, p. -- Peter Meerwald +43-664-2444418 (mobile)