Re: [PATCH v2] Add iwmmxt optimization for sbc for pxa series cpu

Siarhei Siamashka <siarhei.siamashka@xxxxxxxxx> · Fri, 12 Nov 2010 15:22:44 +0200

On Friday 12 November 2010 09:35:04 Keith Mok wrote:
> > Did you run some benchmarks with these optimizations to measure how much
> > they are helping?
> 
> Tested on Marvell PXA platform.
> == Before ==
> $ time ./sbcenc   -b53 -s8 -j  c.au  > /dev/null
> real    0m 0.41s
> user    0m 0.40s
> sys     0m 0.00s
> 
> == After ==
> $ time ./sbcenc   -b53 -s8 -j  c.au  > /dev/null
> real    0m 0.19s
> user    0m 0.17s
> sys     0m 0.02s

Thanks, this looks consistent with the results of optimizations on the other
platforms where the performance increases roughly twice after adding SIMD 
optimizations to the sbc analysis filter.

But maybe it's better to use a bit bigger test file, so that the total time 
increases to at least several seconds. With very small times, it's hard to say
whether it is an actual improvement or random noise. It may be ok for such a
huge performance improvement, but with less significant optimizations the 
precision of measurements may become a problem.

Also do you have oprofile available on PXA platform? It may provide a nice 
statistics about what functions are used and are the performance hot spots. 

> > Using back-to-back WLDRD instructions has some performance penalty
> 
> I rearrange the instructions and keep the original one as for reference in
> the block that comment out. Since the code is really difficult to read
> after interleaved.

Thanks, this looks like it really should run quite a bit faster than the
previous variant (based on my understanding of intel pdf files). 

I sometimes use different indentation levels in such cases in order to improve 
readability after instructions reordering, so that each logically independent
block of code has its own indentation level and it is still easily visible
after instructions reordering. For example, with the original code:

A1
A2
A3
A4
B1
B2
B3
B4

If the instructions need to be reordered in order to improve scheduling for the
cpu pipeline, then for example

A1
A2
  B1
A3
  B2
A4
  B3
  B4

looks much more readable to me than

A1
A2
B1
A3
B2
A4
B3
B4

With different indentation levels, one can still see the flow of instructions 
as independent streams. If different levels of indentation in inline assembly 
pass coding style test by checkpatch.pl script, then it should be fine.

Also I'm quite curious whether better instructions scheduling provide any clear
improvement, so some numbers comparing older and newer implementation would
be appreciated. I did not suggest that just for entertainment purposes ;) It
really should provide some practical benefit.

If you have time and want to make such a test, iwmmxt intrinsics could be also
tried, so that instructions scheduling and registers allocation becomes a
responsibility of the compiler. But my previous experiments with arm neon
intrinsics showed that the compiler does a very poor job and can't be trusted
to generate fast code. But maybe iwmmxt could be different or gcc could have
improved since than.

> > The MMX code was using PCMPGTD and the other instructions just because
> > MMX instruction set is very limited and did not have the needed
> > instructions. But you can use WABS and WMAX instructions to do this job
> > better. You can refer to the original C code and also to ARM NEON
> > optimizations to get some ideas about how to do this operation faster.
> 
> Changed as suggested.
> But got a question that the __IWMMXT__ builtin gcc definition is not a
> reliable way to
> determine whether mcpu=iwmmxt2 is turned on or not. It will break when
> compile under pxa270
> which does not support wabs with just mcpu=iwmmx on.

Well, as I said before, I'm not familiar with iwmmxt and pxa platform. And I
did not notice that there are actually several revisions of iwmmxt isa, my bad.
So looks like iwmmxt1 is just as restrictive as the original mmx and the direct 
conversion from mmx like you did before may be the right thing. For arm neon 
optimizations, the effect of using vector ABS/MAX instructions was just
about 1% of overall performance improvement. Not so much, but every little bit
helps. And if for iwmmxt it causes such backwards compatibility issues, then it
might be not worth it. It's up to you to decide.

I would still suggest to initially have just optimizations for 
sbc_analyze_four_iwmmxt/sbc_analyze_eight_iwmmxt in the first patch (or maybe 
in two patches). And then add optimization for sbc_calc_scalefactors in a
separate patch later.

Regarding the benchmarks and functions usage:
1. sbc_analyze_four_iwmmxt is important for 4 subbands case ('-s4' option for 
sbcenc)
2. sbc_analyze_eight_iwmmxt is important for 8 subbands case ('-s8' option for 
sbcenc)
3. sbc_calc_scalefactors is important for either mono audio, or when joint 
stereo is *not* used (sbcenc is run without '-j' option).

All of this is better to be benchmarked/tested separately.

-- 
Best regards,
Siarhei Siamashka
Attachment:
signature.asc

Description: This is a digitally signed message part.