On Friday 12 November 2010 09:35:04 Keith Mok wrote: > > Did you run some benchmarks with these optimizations to measure how much > > they are helping? > > Tested on Marvell PXA platform. > == Before == > $ time ./sbcenc -b53 -s8 -j c.au > /dev/null > real 0m 0.41s > user 0m 0.40s > sys 0m 0.00s > > == After == > $ time ./sbcenc -b53 -s8 -j c.au > /dev/null > real 0m 0.19s > user 0m 0.17s > sys 0m 0.02s Thanks, this looks consistent with the results of optimizations on the other platforms where the performance increases roughly twice after adding SIMD optimizations to the sbc analysis filter. But maybe it's better to use a bit bigger test file, so that the total time increases to at least several seconds. With very small times, it's hard to say whether it is an actual improvement or random noise. It may be ok for such a huge performance improvement, but with less significant optimizations the precision of measurements may become a problem. Also do you have oprofile available on PXA platform? It may provide a nice statistics about what functions are used and are the performance hot spots. > > Using back-to-back WLDRD instructions has some performance penalty > > I rearrange the instructions and keep the original one as for reference in > the block that comment out. Since the code is really difficult to read > after interleaved. Thanks, this looks like it really should run quite a bit faster than the previous variant (based on my understanding of intel pdf files). I sometimes use different indentation levels in such cases in order to improve readability after instructions reordering, so that each logically independent block of code has its own indentation level and it is still easily visible after instructions reordering. For example, with the original code: A1 A2 A3 A4 B1 B2 B3 B4 If the instructions need to be reordered in order to improve scheduling for the cpu pipeline, then for example A1 A2 B1 A3 B2 A4 B3 B4 looks much more readable to me than A1 A2 B1 A3 B2 A4 B3 B4 With different indentation levels, one can still see the flow of instructions as independent streams. If different levels of indentation in inline assembly pass coding style test by checkpatch.pl script, then it should be fine. Also I'm quite curious whether better instructions scheduling provide any clear improvement, so some numbers comparing older and newer implementation would be appreciated. I did not suggest that just for entertainment purposes ;) It really should provide some practical benefit. If you have time and want to make such a test, iwmmxt intrinsics could be also tried, so that instructions scheduling and registers allocation becomes a responsibility of the compiler. But my previous experiments with arm neon intrinsics showed that the compiler does a very poor job and can't be trusted to generate fast code. But maybe iwmmxt could be different or gcc could have improved since than. > > The MMX code was using PCMPGTD and the other instructions just because > > MMX instruction set is very limited and did not have the needed > > instructions. But you can use WABS and WMAX instructions to do this job > > better. You can refer to the original C code and also to ARM NEON > > optimizations to get some ideas about how to do this operation faster. > > Changed as suggested. > But got a question that the __IWMMXT__ builtin gcc definition is not a > reliable way to > determine whether mcpu=iwmmxt2 is turned on or not. It will break when > compile under pxa270 > which does not support wabs with just mcpu=iwmmx on. Well, as I said before, I'm not familiar with iwmmxt and pxa platform. And I did not notice that there are actually several revisions of iwmmxt isa, my bad. So looks like iwmmxt1 is just as restrictive as the original mmx and the direct conversion from mmx like you did before may be the right thing. For arm neon optimizations, the effect of using vector ABS/MAX instructions was just about 1% of overall performance improvement. Not so much, but every little bit helps. And if for iwmmxt it causes such backwards compatibility issues, then it might be not worth it. It's up to you to decide. I would still suggest to initially have just optimizations for sbc_analyze_four_iwmmxt/sbc_analyze_eight_iwmmxt in the first patch (or maybe in two patches). And then add optimization for sbc_calc_scalefactors in a separate patch later. Regarding the benchmarks and functions usage: 1. sbc_analyze_four_iwmmxt is important for 4 subbands case ('-s4' option for sbcenc) 2. sbc_analyze_eight_iwmmxt is important for 8 subbands case ('-s8' option for sbcenc) 3. sbc_calc_scalefactors is important for either mono audio, or when joint stereo is *not* used (sbcenc is run without '-j' option). All of this is better to be benchmarked/tested separately. -- Best regards, Siarhei Siamashka
Attachment:
signature.asc
Description: This is a digitally signed message part.