"Hot" function optimization recommendations

justin@xxxxxxxxxxxxxx (Justin Chudgar) · Mon, 08 Apr 2013 12:02:46 -0700

On Thursday, April 04, 2013 04:08:43 PM Justin Chudgar wrote:
> I had experimentally thrown an optimization into my module's only
> significantly warm functions. Since I am a novice, this was a
> just-for-kicks experiment, but I would like to know whether to optimize at
> all beyond the general "-O2", and what platforms are critical to consider
> since I only use pulse on systems that are sufficient to run at "-O0"
> without noticeable problems beyond unnecessary power consumption.
> 
> From another thread:
> > I'm not sure what to think about the __attribute__((optimize(3))) usage.
> > Have you done some benchmarking that shows that the speedup is
> > significant compared to the normal -O2? If yes, I guess we can keep
> > them. <tanuk>
> 
> I don't know what to think of them either. I did a really simplist benchmark
> with the algorithm on my core i3 laptop initially to determine if it was
> useful to keep everything double or float. There was no benefit to reducing
> presicion on this one system, but that attribute was dramatic. Did not try
> O2, though, just 03 and O0. I thought about messing with vectorization, but
> I only have x86-64 PCs and that seems most valuable for embedded devices
> which I cannot test at the moment.
> 
> 11: Determine optimization strategy for filter code.
> http://github.com/justinzane/pulseaudio/issues/issue/11
> 
> 
> _______________________________________________
> pulseaudio-discuss mailing list
> pulseaudio-discuss at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/pulseaudio-discuss

Just some very simplistic benchmark results of  
	"__attribute__((optimize(#))) function()" 
in code similar to a biquad filter:
	optimize(0), 1867570825, 27.828974
	optimize(1), 1017762024, 15.165836
	optimize(2), 951896198, 14.184359
	optimize(3), 952574300, 14.194463
This is for "memchunk" analogs of single channel 2^16 doubles being filtered 
and averaged over 2^10 runs with forced cpu affinity. The benchmark itself was 
compiled with -O0.

With the supporting code compiled -O2, the numbers are:
	optimize(0), 1436955156, 21.412300
	optimize(1), 1020384309, 15.204911
	optimize(2), 952980992, 14.200523
	optimize(3), 952473365, 14.192959
Not much difference there.

With the benchmark compiled -O3, there is a DRASTIC change:
	optimize(0), 1442046736, 21.488171
	optimize(1), 1017924249, 15.168253
	optimize(2), 954029138, 14.216142
	optimize(3), 374432, 0.005579
That was such a freakish improvement, that I ran it several times, but the 
results are quite reliable on my dev system.

Replacing the optimize(#) with hot and using -O3 for the whole gives:
	hot, 310780, 0.004631

And removing the __attribute__ altogether, again using -O3 for the whole 
gives:
	<NONE>, 333013, 0.004962

Being generally a novice using a VERY simplistic wrapper of a rather simple 
function, I'm loathe to draw too many conclusions. However, this suggests that 
it might be worth using __attribute__(hot) for any serious number crunching 
functions within pulse and adopting the -O3 compiler flags as the standard.

If I can figure out oprofile or something similar, I'll try to test. I'd also 
like to hear general feedback about this since I'm just learning. Thanks, all.

Justin