On Jul 30, 2008, at 12:08 PM, Lennart Poettering wrote: > On Tue, 22.07.08 17:13, Nick Thompson (rextanka at comcast.net) wrote: > >>>> So i am not sure what part of Pulseaudio is causing high CPU >>>> Utilization >>>> .. >>> >>> Hmm, could you do some profiling then? Just the most basic. I.e. >>> what >>> functions take up most CPU. >> >> I'll add more detail in a bit when we get to 0.9.10, h >> pa_volume_memchunk >> seems to be a big hitter on arm systems: > > Hmm, that function is not optimized in any way, but if I look on its > sources doesn't appear that slow to me either. For each sample we do > one multiplication, one shifting, we appy saturation and then we > increase/decrease poinetrs with wrap around. That shouldn't be that > bad. Also, this code goes once linearly through all samples, which > should > minimize influence of the cache. Yeah the problem seems to be that ARM has a limited number of registers and gcc does not deal with monolithic code that well, where as x86 will have no issues in dealing with a large case statement with a number of loops in it. A look at the gcc output indicated a number of load instrs in the loop which is very expensive on ARM (3 cycles). A co-worker has been working on some arm assembly and factoring the loops out into separate functions, and the net result of this is that we see 4-6% total, which is much better. We need to look at the mix and rate convert cases too, the mix is more complicated but we should see something similar there. Also I want to look at Kevin's emails and see if we can build on that, it would be good to get that working on a couple systems. Patches will be forthcoming for this, however we are still on 0.9.8, I am hoping we'll have made the move to 0.9.10 this week, so I hope we can send you stuff in a couple weeks once we are happy they are tested well. The patches will be for 0.9.10 so help in getting that merged with the latest would be handy, though I'd suspect these routines are not so much in flux at the moment? > I assume the data processes is S16NE and the CPU is LE? Yup. That's the path we started with the optimization. > Hmm, can you figure out in which context this is called that often? (i > mean, pa's audio memory management should be mostly zero-copy, so > having such a big hit on memcpy here is surprising to me. > >>> 277 3.5951 libm-2.5.so __adddf3 > > This is interesting, could you figure out the context? Yup, I need to patch our kernel again to get call trace with oprofile on my device, so hopefully I can find some time to get context for this. With regards to the vectorization stuff, that can be used, although it would make the arm code very specific to a certain subset of ARM implementations. It brings a philosophical question, since I'd suspect a generic ARM implementation is a better open source solution, having the optimized cases for cortex-a8/NEON processors would be useful, but it would add to the build complexity, and potentially would be frustrating for someone with a different ARM processor. I'm not sure I understand open source well enough to decide this. That being said we'll probably optimize for our case and that any patches will likely be somewhat system dependent. Worst case is it gives someone else something to build on, I guess. Regards Nick