[PATCH 1/6 v3] core: Initialize ARM NEON code if available

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

> Surprise! I'm reviewing this now. :p

indeed :)

> 1. v3 drops intrinsics in favour of inline asm -- is that for
> performance reasons?

I noticed performance issues with certain compiler versions; inline asm 
offers more control/defined output; further, alignment annotations are not 
available with intrinsics -- currently they are not used because I'm not 
sure about the alignment guarantees of certain PA buffers; intrinsics could 
probably be added later if there is enough interest

> 2. In the mono->stereo float case, the Cortex A9 code is actually
> slower. I recall that in a previous thread, we had this sort of
> situation on one of Panda/Beagleboard. Do we need some way to pick and
> choose implementations?

I only have beagleboard-xm and pandabaord available as test platforms 
(Cortax A8 and A9, resp.)
PATCH 2/6 now tests for A8 vs A9/A15/Axxx and chooses code accordingly

another issue is benchmarking: relative performance is different depending 
on the length of the buffers processed, whether they are cached

my target task involves stereo recording, resampling, int/float 
conversion, stereo-to-mono and mono-to-stereo mapping and I am seeing good 
speedups on both beagle- and pandaboard
 
I need to check the downmix to mono behaviour after 
ff4af902cf4ac07c5f1da3b6dacbb3195c7c222d
    resampler: Fix volume on downmix to mono

> 3. How shall we go about enabling this code? Have a configure time check
> for some instructions that are needed, build it in if available, and
> then run-time detection should pick the right code path?

I'd suggest to model after bluetooth/sbc: compile the *_neon.c files 
always but only activate the NEON code if defined(__ARM_NEON__)

disadvantage is that we cannot have a common executable for NEON/non-NEON 
ARM CPUs -- I don't think this is a big constraint

Remi Denis-Courmont suggests to use .s assembler files to overcome this 
issue; this would necessitate some configure options as well

interestingly, on x86/AMD64 gcc can emit MMX/SSE code in inline asm even 
when the compiler itself is not enabled to generate such instructions -- 
hence no .s files in PA so far

at runtime there already is an env. var PULSE_NO_SIMD to disable optimized 
code path; further the output of /proc/cpuinfo is parsed to see if NEON is 
available (kind of pointless since it is a compile-time decision)

> I'll take a closer look at things, run some tests, and start pushing
> this work. I'll also be moving all the test code to src/tests/cpu-test.c
> where the x86 tests have been consolidated, so running tests on
> different boards should become a lot less painful.

thank you for the effort; let me know if there are questions!
tests are not straightforward in some cases as the actual implementation 
is not exported

orc is broken on NEON, the loadpq is not supported

thanks, p.

-- 

Peter Meerwald
+43-664-2444418 (mobile)


[Index of Archives]     [Linux Audio Users]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux