On 4/9/07, Sampo Savolainen <v2@xxxxxx> wrote:
I'm kind of writing this as a general note to other audio authors.
One big win on large data sets (lots of tracks) that overrun your available cache, is to cache align the floating point data rather than simd align it. On several of my oprofiles and test cases with ardour + jack where I was overrunning L2 cache regularly, cache alignment sped up simd calculations by 9-40%.
On a linux system, to find the optimal alignment for your data, do a cat /proc/cpuinfo | grep cache_alignment.
and go looking through whatever code you are running for calls to "posix_memalign". For example, jack uses a default alignment of 16, when 64 or 128 would be more appropo' for modern processors.
Cache alignment is an easy test, I hope more people try it....
At present the major bottlenecks in ardour and jackd have all been reduced considerably by extensive oprofiling the hot spots and replacing them, where possible, with SSE. In typical situations the SSE routines are still near the top of the runtime for those two programs. Without SSE, the "normal" equivalents are in general responsible for 3-8x more of the total runtime. Recent example: SSE optimizations to ardour cut total cpu usage for an extreme test case (116 tracks, 40+ busses) from 77% down to 35% in the 64 samples/period case, and from 35% to 12% in the 1024 samples/period case.
SSE routines could be used to speed up graphics as well, particularly in RGBA situations.
Linux's Oprofile subsystem is wonderful as it has low overhead and can run on smp'd rt kernels.
At present the major bottlenecks left for ardour and jack are very much down in the noise floor. A typical user now spends more cpu time in plugins than in those two core programs. (thus I've been oprofiling a few plugins heavily, among other things, now have a SSE'd comb_run routine... and hopefully will announce some sped up plugins soon)
I note that x86_64 comes with SSE and SSE2 by default and that taking branches (e.g: determining at run time if SSE1 or SSE2 is available), particularly at low period sizes (64), is expensive, so it would be nice if more audio code, when compiled for x86_64, used SSE by default, without the run-time test.
gcc 4.3 - which is a long way from working - has a vectorizer which understands the type conversions so critical to SSE usage. (sampos's famed assembler SSE peak code uses a clever type conversion)
To enable automatic vectorization in gcc 4.1.X, you can turn it on by -ftree-vectorize
and see what it is doing by -ftree-vectorizer-verbose=5
And weep. for example: In the zillions of lines of code in Ardour 2, only 2 loops get automatically vectorized with gcc 4.1.X.
Not only that, but writing SSE code is FUN! It's one of the few cases left in this world where a the human can still be smarter than the compiler!
On Mon, 2007-04-09 at 16:26 +0200, Dragan Noveski wrote:
> Mike Taht wrote:
> > does jack say it's running SSE on startup?
> now, since i recompiled with --enable-dynsimd it says:
>
> ...
> JACK tmpdir identified as [/dev/shm/]
> SSE2 detected
> load = 0.2297 max usecs: 40.000, spare = 10626.000
> ...
>
> looks like a nice hint??
>
> what is about processor type and architecture?
> are there more hints about optimizing?
I'm kind of writing this as a general note to other audio authors.
One big win on large data sets (lots of tracks) that overrun your available cache, is to cache align the floating point data rather than simd align it. On several of my oprofiles and test cases with ardour + jack where I was overrunning L2 cache regularly, cache alignment sped up simd calculations by 9-40%.
On a linux system, to find the optimal alignment for your data, do a cat /proc/cpuinfo | grep cache_alignment.
and go looking through whatever code you are running for calls to "posix_memalign". For example, jack uses a default alignment of 16, when 64 or 128 would be more appropo' for modern processors.
Cache alignment is an easy test, I hope more people try it....
Enabling SSE throughout the project via compiler flags makes the
resulting code /depend/ on SSE. In other words, running that on a
platform with no SSE will result in "Illegal instruction (core dumped)".
At present the major bottlenecks in ardour and jackd have all been reduced considerably by extensive oprofiling the hot spots and replacing them, where possible, with SSE. In typical situations the SSE routines are still near the top of the runtime for those two programs. Without SSE, the "normal" equivalents are in general responsible for 3-8x more of the total runtime. Recent example: SSE optimizations to ardour cut total cpu usage for an extreme test case (116 tracks, 40+ busses) from 77% down to 35% in the 64 samples/period case, and from 35% to 12% in the 1024 samples/period case.
SSE routines could be used to speed up graphics as well, particularly in RGBA situations.
Linux's Oprofile subsystem is wonderful as it has low overhead and can run on smp'd rt kernels.
At present the major bottlenecks left for ardour and jack are very much down in the noise floor. A typical user now spends more cpu time in plugins than in those two core programs. (thus I've been oprofiling a few plugins heavily, among other things, now have a SSE'd comb_run routine... and hopefully will announce some sped up plugins soon)
In projects like jackd and Ardour there are places which can be improved
vastly via SSE code. Creating a framework which can enable SSE / etc.
per the platform the binary is ran makes it possible for distributions
to include optimized versions of the software which will work on any x86
platform.
I note that x86_64 comes with SSE and SSE2 by default and that taking branches (e.g: determining at run time if SSE1 or SSE2 is available), particularly at low period sizes (64), is expensive, so it would be nice if more audio code, when compiled for x86_64, used SSE by default, without the run-time test.
I have heard good things about the current development branch of gcc,
but gcc 4.1 still has a _long_ way to go when it comes to vectorizing
(=writing code using parallel SIMD instructions, in other words SSE).
gcc 4.3 - which is a long way from working - has a vectorizer which understands the type conversions so critical to SSE usage. (sampos's famed assembler SSE peak code uses a clever type conversion)
To enable automatic vectorization in gcc 4.1.X, you can turn it on by -ftree-vectorize
and see what it is doing by -ftree-vectorizer-verbose=5
And weep. for example: In the zillions of lines of code in Ardour 2, only 2 loops get automatically vectorized with gcc 4.1.X.
Hand written assembler is still many orders faster than what gcc is
capable of doing. In Ardour peak computation (for both metering and
waveform displaying) is written in SSE (the first part in pure assembly,
the second in a C-level abstraction which is almost 1:1 assembly). Both
functions are more than 20x faster in raw performance than what gcc 4.1
can do.
Not only that, but writing SSE code is FUN! It's one of the few cases left in this world where a the human can still be smarter than the compiler!
Sampo
_______________________________________________
Linux-audio-user mailing list
Linux-audio-user@xxxxxxxxxxxxxxxxxxxx
http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user
--
Mike Taht
PostCards From the Bleeding Edge
http://the-edge.blogspot.com
_______________________________________________ Linux-audio-user mailing list Linux-audio-user@xxxxxxxxxxxxxxxxxxxx http://lists.linuxaudio.org/mailman/listinfo.cgi/linux-audio-user