24.04.2014 22:09, Peter Meerwald wrote: > From: Peter Meerwald <p.meerwald at bct-electronic.com> > > The generic matrix remapping is rather inefficient; special-case code > improves performance by 3x easily. I have looked at this and the 10th patch. For 10/11, I have no objections. 11/11 definitely works and improves things, but... > +static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, const int16_t *src, unsigned n) { > + unsigned i; > + > + for (i = n >> 2; i > 0; i--) { > + dst[0] = (src[0] + src[1])/2; > + dst[1] = (src[2] + src[3])/2; > + dst[2] = (src[4] + src[5])/2; > + dst[3] = (src[6] + src[7])/2; > + src += 8; > + dst += 4; > + } > + for (i = n & 3; i; i--) { > + dst[0] = (src[0] + src[1])/2; > + src += 2; > + dst += 1; > + } > +} Why are we doing the compiler's job here? Yes, I understand that there are precedents of manually unrolling the loop here, but this actually slows things down with -O3 on gcc-4.8.2! Here are my results regarding stereo to mono s16ne conversions with different CFLAGS on an amd64 machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo Boost). The tests below are with the cpu-test rework patches applied (but not reviewed). With -O2 -pipe, and your code, I get: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442). orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169). With -O3 -pipe, and your code, I get: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987). orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764). Now let's test this: static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, const int16_t *src, unsigned n) { while (n--) { dst[0] = (src[0] + src[1])/2; src += 2; dst += 1; } } With -O2 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113). orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855). With -O3 -pipe: Checking special remap (s16, stereo->mono) Forced to use generic matrix remapping Using stereo to mono remapping Testing remap performance with 3 sample alignment func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386). orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412). I.e. -O3 with the simplest possible implementation slightly beats your hand-optimized loop here. probably because the compiler was smart enough to insert some SSE2 stuff automatically. The above should not be counted as an objection to your patch. We can always clean up this and the existing hand-rolled code later. Now waiting while clang-3.4 compiles... -- Alexander E. Patrakov