> > - performance degradation on Cortex-A9 / pandaboard for remap: NEON is > > fast on Cortex-A8 but slow on A9; need to distinguish > Does it really degrade? Compared to C code? That seems surprising. the problem is just one particular, very simple workload: mono_to_stereo remapping of floats; basically, you get wxyz and output wwxxyyzz (w..z are audio samples stored as float) on A8 I suggest the following (for 4 samples): vld1.32 {q0}, [%[src]]! vmov q1, q0 vst2.32 {q0,q1}, [%[dst]]! on A9 I suggest the following (for 2 samples): ldm %[src]!, {r4,r6} mov r5, r4 mov r7, r6 stm %[dst]!, {r4-r7} the compiler generates something like (or 1 sample), which is pretty close to the A9 code above performance-wise (but sucks on A8) ldr r3, [%[src]]! str r3, [%[dst], #0] str r3, [%[dst], #4] all other NEON optimizations are better than plain C code (compiled with gcc 4.6.3), even on A9 I will provide microbenchmarks on A8/A9 when submitting the patches > I read (on android-ndk) that the speedup through NEON is a lot smaller on A9 > (60% vs 10% in one scenario), but it's still a speedup. > This is a part of that conversation: thank you for the pointer; those are general statements I trend to agree with regards, p. -- Peter Meerwald +43-664-2444418 (mobile)