26.04.2014 18:19, Alexander E. Patrakov wrote: > 24.04.2014 22:09, Peter Meerwald wrote: >> From: Peter Meerwald <p.meerwald at bct-electronic.com> >> >> The generic matrix remapping is rather inefficient; special-case code >> improves performance by 3x easily. > > I have looked at this and the 10th patch. For 10/11, I have no > objections. 11/11 definitely works and improves things, but... > >> +static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, >> const int16_t *src, unsigned n) { >> + unsigned i; >> + >> + for (i = n >> 2; i > 0; i--) { >> + dst[0] = (src[0] + src[1])/2; >> + dst[1] = (src[2] + src[3])/2; >> + dst[2] = (src[4] + src[5])/2; >> + dst[3] = (src[6] + src[7])/2; >> + src += 8; >> + dst += 4; >> + } >> + for (i = n & 3; i; i--) { >> + dst[0] = (src[0] + src[1])/2; >> + src += 2; >> + dst += 1; >> + } >> +} > > Why are we doing the compiler's job here? Yes, I understand that there > are precedents of manually unrolling the loop here, but this actually > slows things down with -O3 on gcc-4.8.2! Here are my results regarding > stereo to mono s16ne conversions with different CFLAGS on an amd64 > machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo > Boost). > > The tests below are with the cpu-test rework patches applied (but not > reviewed). > > With -O2 -pipe, and your code, I get: > > Checking special remap (s16, stereo->mono) > Forced to use generic matrix remapping > Using stereo to mono remapping > Testing remap performance with 3 sample alignment > func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442). > orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169). > > With -O3 -pipe, and your code, I get: > > Checking special remap (s16, stereo->mono) > Forced to use generic matrix remapping > Using stereo to mono remapping > Testing remap performance with 3 sample alignment > func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987). > orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764). > > Now let's test this: > > static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, > const int16_t *src, unsigned n) { > while (n--) { > dst[0] = (src[0] + src[1])/2; > src += 2; > dst += 1; > } > } > > With -O2 -pipe: > > Checking special remap (s16, stereo->mono) > Forced to use generic matrix remapping > Using stereo to mono remapping > Testing remap performance with 3 sample alignment > func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113). > orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855). > > With -O3 -pipe: > > Checking special remap (s16, stereo->mono) > Forced to use generic matrix remapping > Using stereo to mono remapping > Testing remap performance with 3 sample alignment > func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386). > orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412). > > I.e. -O3 with the simplest possible implementation slightly beats your > hand-optimized loop here. probably because the compiler was smart enough > to insert some SSE2 stuff automatically. > > The above should not be counted as an objection to your patch. We can > always clean up this and the existing hand-rolled code later. > > Now waiting while clang-3.4 compiles... > Simple code, clang, -O2 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 82375 usec (avg: 823.75, min = 794, max = 1387, stddev = 90.7334). orig: 134835 usec (avg: 1348.35, min = 1263, max = 2151, stddev = 110.471). Simple code, clang, -O3 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 80987 usec (avg: 809.87, min = 794, max = 1016, stddev = 30.6149). orig: 130819 usec (avg: 1308.19, min = 1287, max = 1507, stddev = 38.0144). Your code, -O2 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 63764 usec (avg: 637.64, min = 615, max = 946, stddev = 39.2402). orig: 132069 usec (avg: 1320.69, min = 1302, max = 1658, stddev = 45.6867). Your code, -O3 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 61143 usec (avg: 611.43, min = 598, max = 801, stddev = 32.9057). orig: 130071 usec (avg: 1300.71, min = 1286, max = 1641, stddev = 43.0877). OK, so on clang your code has its benefits. Keep it. -- Alexander E. Patrakov