On Thu, 2010-10-28 at 01:47 +0100, Arun Raghavan wrote: > On Wed, 2010-10-27 at 15:14 -0500, pl bossart wrote: > > > I've been doing some work optimising the software volume scaling code, > > > and along with my previous changes to decrease the maximum volume to > > > 2^31-1, there seems to be a pretty good performance increase (almost 2x > > > on my Core2 processor). > > > > Are you saying you have a 2x performance gain over sse assembly? That > > would most likely mean we need to fix the assembly for x86 and have an > > even better performance than with orc and its intermediate step of > > SIMD code generation... > > That is what I got even when I replaced the 32x16-bit volume > multiplication code with the same logic that I'm using in Orc. I don't I forgot to mention that even the Orc MMX backend provides the same kind of perf gain over the current hand-rolled code (I didn't try to rewrite that like I did the SSE). Also, we don't have any NEON optimisations for the s/w volume stuff in PA, so the Orc NEON backend might be interesting to try there. I don't know if there's a SIMD version of the ARM 32x16-bit mul, but if there is, it's possible to get Orc to use that as well. Cheers, Arun