Re : Effects of Clock Resolution on Pulseaudio

rextanka@xxxxxxxxxxx (Nick Thompson) · Wed, 30 Jul 2008 15:00:30 -0700

On Jul 30, 2008, at 12:08 PM, Lennart Poettering wrote:

> On Tue, 22.07.08 17:13, Nick Thompson (rextanka at comcast.net) wrote:
>
>>>> So i am not sure what part of Pulseaudio is causing high CPU  
>>>> Utilization
>>>> ..
>>>
>>> Hmm, could you do some profiling then? Just the most basic. I.e.  
>>> what
>>> functions take up most CPU.
>>
>> I'll add more detail in a bit when we get to 0.9.10, h  
>> pa_volume_memchunk
>> seems to be a big hitter on arm systems:
>
> Hmm, that function is not optimized in any way, but if I look on its
> sources doesn't appear that slow to me either. For each sample we do
> one multiplication, one shifting, we appy saturation and then we
> increase/decrease poinetrs with wrap around. That shouldn't be that
> bad. Also, this code goes once linearly through all samples, which  
> should
> minimize influence of the cache.

Yeah the problem seems to be that ARM has a limited number of  
registers and gcc does not deal with monolithic code that well, where  
as x86 will have no issues in dealing with a large case statement with  
a number of loops in it.  A look at the gcc output indicated a number  
of load instrs in the loop which is very expensive on ARM (3 cycles).   
A co-worker has been working on some arm assembly and factoring the  
loops out into separate functions, and the net result of this is that  
we see 4-6% total, which is much better.  We need to look at the mix  
and rate convert cases too, the mix is more complicated but we should  
see something similar there.

Also I want to look at Kevin's emails and see if we can build on that,  
it would be good to get that working on a couple systems.

Patches will be forthcoming for this, however we are still on 0.9.8, I  
am hoping we'll have made the move to 0.9.10 this week, so I hope we  
can send you stuff in a couple weeks once we are happy they are tested  
well.  The patches will be for 0.9.10 so help in getting that merged  
with the latest would be handy, though I'd suspect these routines are  
not so much in flux at the moment?

> I assume the data processes is S16NE and the CPU is LE?

Yup.  That's the path we started with the optimization.

> Hmm, can you figure out in which context this is called that often? (i
> mean, pa's audio memory management should be mostly zero-copy, so
> having such a big hit on memcpy here is surprising to me.
>
>>> 277       3.5951  libm-2.5.so              __adddf3
>
> This is interesting, could you figure out the context?

Yup, I need to patch our kernel again to get call trace with oprofile  
on my device, so hopefully I can find some time to get context for this.

With regards to the vectorization stuff, that can be used, although it  
would make the arm code very specific to a certain subset of ARM  
implementations.  It brings a philosophical question, since I'd  
suspect a generic ARM implementation is a better open source solution,  
having the optimized cases for cortex-a8/NEON processors would be  
useful, but it would add to the build complexity, and potentially  
would be frustrating for someone with a different ARM processor.  I'm  
not sure I understand open source well enough to decide this.  That  
being said we'll probably optimize for our case and that any patches  
will likely be somewhat system dependent.  Worst case is it gives  
someone else something to build on, I guess.

Regards

Nick