[PATCH 11/11] remap: Add stereo to mono and 4-channel special case remapping

patrakov@xxxxxxxxx (Alexander E. Patrakov) · Sat, 26 Apr 2014 18:19:57 +0600

24.04.2014 22:09, Peter Meerwald wrote:
> From: Peter Meerwald <p.meerwald at bct-electronic.com>
>
> The generic matrix remapping is rather inefficient; special-case code
> improves performance by 3x easily.

I have looked at this and the 10th patch. For 10/11, I have no 
objections. 11/11 definitely works and improves things, but...

> +static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, const int16_t *src, unsigned n) {
> +    unsigned i;
> +
> +    for (i = n >> 2; i > 0; i--) {
> +        dst[0] = (src[0] + src[1])/2;
> +        dst[1] = (src[2] + src[3])/2;
> +        dst[2] = (src[4] + src[5])/2;
> +        dst[3] = (src[6] + src[7])/2;
> +        src += 8;
> +        dst += 4;
> +    }
> +    for (i = n & 3; i; i--) {
> +        dst[0] = (src[0] + src[1])/2;
> +        src += 2;
> +        dst += 1;
> +    }
> +}

Why are we doing the compiler's job here? Yes, I understand that there 
are precedents of manually unrolling the loop here, but this actually 
slows things down with -O3 on gcc-4.8.2! Here are my results regarding 
stereo to mono s16ne conversions with different CFLAGS on an amd64 
machine (Intel(R) Core(TM) i7-4770S forced to 3.9 GHz by Intel Turbo Boost).

The tests below are with the cpu-test rework patches applied (but not 
reviewed).

With -O2 -pipe, and your code, I get:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 62098 usec (avg: 620.98, min = 612, max = 764, stddev = 20.9442).
orig: 125770 usec (avg: 1257.7, min = 1247, max = 1392, stddev = 24.9169).

With -O3 -pipe, and your code, I get:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 120105 usec (avg: 1201.05, min = 1157, max = 1472, stddev = 50.5987).
orig: 127543 usec (avg: 1275.43, min = 1234, max = 1682, stddev = 56.4764).

Now let's test this:

static void remap_stereo_to_mono_s16ne_c(pa_remap_t *m, int16_t *dst, 
const int16_t *src, unsigned n) {
     while (n--) {
         dst[0] = (src[0] + src[1])/2;
         src += 2;
         dst += 1;
     }
}

With -O2 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 82468 usec (avg: 824.68, min = 814, max = 984, stddev = 23.8113).
orig: 126014 usec (avg: 1260.14, min = 1248, max = 1429, stddev = 27.8855).

With -O3 -pipe:

Checking special remap (s16, stereo->mono)
Forced to use generic matrix remapping
Using stereo to mono remapping
Testing remap performance with 3 sample alignment
func: 57797 usec (avg: 577.97, min = 567, max = 687, stddev = 18.9386).
orig: 123601 usec (avg: 1236.01, min = 1219, max = 1377, stddev = 30.3412).

I.e. -O3 with the simplest possible implementation slightly beats your 
hand-optimized loop here. probably because the compiler was smart enough 
to insert some SSE2 stuff automatically.

The above should not be counted as an objection to your patch. We can 
always clean up this and the existing hand-rolled code later.

Now waiting while clang-3.4 compiles...

-- 
Alexander E. Patrakov