"Hot" function optimization recommendations

kugel@xxxxxxxxxxx (Thomas Martitz) · Tue, 09 Apr 2013 09:49:18 +0200

Am 08.04.2013 21:02, schrieb Justin Chudgar:
> On Thursday, April 04, 2013 04:08:43 PM Justin Chudgar wrote:
>> I had experimentally thrown an optimization into my module's only
>> significantly warm functions. Since I am a novice, this was a
>> just-for-kicks experiment, but I would like to know whether to optimize at
>> all beyond the general "-O2", and what platforms are critical to consider
>> since I only use pulse on systems that are sufficient to run at "-O0"
>> without noticeable problems beyond unnecessary power consumption.
>>
>>  From another thread:
>>> I'm not sure what to think about the __attribute__((optimize(3))) usage.
>>> Have you done some benchmarking that shows that the speedup is
>>> significant compared to the normal -O2? If yes, I guess we can keep
>>> them. <tanuk>
>> I don't know what to think of them either. I did a really simplist benchmark
>> with the algorithm on my core i3 laptop initially to determine if it was
>> useful to keep everything double or float. There was no benefit to reducing
>> presicion on this one system, but that attribute was dramatic. Did not try
>> O2, though, just 03 and O0. I thought about messing with vectorization, but
>> I only have x86-64 PCs and that seems most valuable for embedded devices
>> which I cannot test at the moment.
>>
>> 11: Determine optimization strategy for filter code.
>> http://github.com/justinzane/pulseaudio/issues/issue/11
>>
>>
>> _______________________________________________
>> pulseaudio-discuss mailing list
>> pulseaudio-discuss at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/pulseaudio-discuss
> Just some very simplistic benchmark results of
> 	"__attribute__((optimize(#))) function()"
> in code similar to a biquad filter:
> 	optimize(0), 1867570825, 27.828974
> 	optimize(1), 1017762024, 15.165836
> 	optimize(2), 951896198, 14.184359
> 	optimize(3), 952574300, 14.194463
> This is for "memchunk" analogs of single channel 2^16 doubles being filtered
> and averaged over 2^10 runs with forced cpu affinity. The benchmark itself was
> compiled with -O0.
>
> With the supporting code compiled -O2, the numbers are:
> 	optimize(0), 1436955156, 21.412300
> 	optimize(1), 1020384309, 15.204911
> 	optimize(2), 952980992, 14.200523
> 	optimize(3), 952473365, 14.192959
> Not much difference there.
>
> With the benchmark compiled -O3, there is a DRASTIC change:
> 	optimize(0), 1442046736, 21.488171
> 	optimize(1), 1017924249, 15.168253
> 	optimize(2), 954029138, 14.216142
> 	optimize(3), 374432, 0.005579
> That was such a freakish improvement, that I ran it several times, but the
> results are quite reliable on my dev system.
>

This seems wrong. Does the code still execute *correctly*, does it even 
run the benchmark at all at -O3? I suspect -O3 optimized large sections 
of code away which may (or may not) produce incorrect code, perhaps 
because because the benchmark code relies on undefined behavior or a bug 
in gcc.

Best regards.