Re : Effects of Clock Resolution on Pulseaudio

keithpre@xxxxxxxxx (keith preston) · Wed, 30 Jul 2008 16:23:19 -0500

>> There is also an array lookup of the channel volume (every for loop
>> cycle), and two increment variables.   With an ARM processor this is
>> probably enough extra variables to go past the number of registers and
>> cause stack manipulation.   The easiest things would to be to process
>> one channel at a time, incrementing your pointer properly and using
>> the end of the array pointer as a stop point instead of keeping two
>> count variables.  I also hope you have optimizations turned on in your
>> compiler or you will get a divide instead of a shift.
>
> Hmm, going through the memory block more than once, wouldn't that be
> bad for the cache?
>

Cache is only going to help you if you are reading the same address
multiple times.   If you read each address once cache will not help
you.   I guess it is possible because they are 16 bit samples that you
would be reading each 32 bits from memory twice, if you went through
the channels separately.    However you still really aren't destroying
your cache, unless your buffer is bigger then the CPU cache (or half
of its size).   Typically pulse will be dealing with 4k-8k buffers and
I don' t know any cpus with that small of cache.

> The volume array is just two entries in most cases
> (i.e. stereo). Shouldn't be that bad.

It still adds local variables and every array lookup is the equivalent
of a multiply (array[i] is &array + sizeoftype * i), this is still
much slower then if it was in a register.   Plus the increment and if
branch at the bottom can kill loop performance, especially because in
the case of 2 channel the branch changes every cycle.

>
> I'd be very happy to take patches like this if they are clean and not
> too invasive. Having special code for certain channel setups is
> absolutely fine for me.
>
> Please make sure to submit patches like this one upstream!
>

I've been trying to figure out a way to make the patch less invasive,
but I do have performance as a primary goal.  I hope that I have
enough time to clean them up and get them upstream.

> What surprises me however is that your mixing is still fast even if
> you iterate through your buffers all at the same time. I have not
> spent much time doing optimization work, however, I'd assume that
> improving locality of the data (i.e. by not looking at all mix buffers
> at the same time, but just two) would bring the best speed benefits.

Locality really only helps if you are doing lots of operation on one
piece of memory.   Since we are in a streaming processing mode we
typically only see data once as it goes through.  Memory access is
generally not something we can improve in this situation.   This
generally means that we are limited by the arithmetic and branching
operations we are doing.

Although locality and cache can help to speed up operation in a stream
processing case, you are often better suited by shrinking you inner
loops as much as possible and avoiding branches.   After doing this,
compiler are generally better at optimizing them most other tricks.
Take for example the current pa_mix function:

  for (d = 0;; d += sizeof(int16_t)) {
                int32_t sum = 0;
                unsigned i;

                if (PA_UNLIKELY(d >= length))
                    goto finish;

                for (i = 0; i < nstreams; i++) {
                    pa_mix_info *m = streams + i;
                    int32_t v, cv = m->linear[channel].i;

                    if (PA_UNLIKELY(d >= m->chunk.length))
                        goto finish;

                    if (PA_UNLIKELY(cv <= 0) || PA_UNLIKELY(!!mute) ||
PA_UNLIKELY(linear[channel] <= 0))
                        v = 0;
                    else {
                        v = *((int16_t*) m->ptr);
                        v = (v * cv) / 0x10000;
                    }

                    sum += v;
                    m->ptr = (uint8_t*) m->ptr + sizeof(int16_t);
                }

                sum = PA_CLAMP_UNLIKELY(sum, -0x8000, 0x7FFF);
                sum = (sum * linear[channel]) / 0x10000;
                *((int16_t*) data) = (int16_t) sum;

                data = (uint8_t*) data + sizeof(int16_t);

                if (PA_UNLIKELY(++channel >= spec->channels))
                    channel = 0;
            }

Here are a few problem spots that I see:
                   if (PA_UNLIKELY(d >= m->chunk.length))
                        goto finish;
This is completely unnecessary in the loop (especially the inner one)
and could be tested before hand.
                         v = (v * cv) / 0x10000;
I prefer to explicitly make this a shift and not leave it to the
compiler to help us out.

                    if (PA_UNLIKELY(cv <= 0) || PA_UNLIKELY(!!mute) ||
PA_UNLIKELY(linear[channel] <= 0))
                        v = 0;
                    else {
                        v = *((int16_t*) m->ptr);
                        v = (v * cv) / 0x10000;
                    }
This three way if statement is often much worse then just doing a
multiply against 0;   It costs three different tests in every case and
only save one multiply (typically processor can tack on a shift at the
end of a multiply so that is free) in a rare case.  If you really want
to check for negative values then you could probably do it before the
loop and just leave out the channels from computation.
        for (d = 0;; d += sizeof(int16_t)) {
                    m->ptr = (uint8_t*) m->ptr + sizeof(int16_t);
Because we are dealing with int16_t in this case all of our pointers
should be of that type.   This gives the most information to the
compiler for optimization.   By using other types of pointer, because
i would assume it made cleaner looking code could slow us down
                if (PA_UNLIKELY(++channel >= spec->channels))
                    channel = 0;
I still really don't like this, because the branch changes frequently
(kills branch prediction).  Because the loop is made for a generic
number of channels you force the compile to do an array lookup on the
channel volume every time as opposed to letting the compiler possibly
optimizing it into a register.

If you look at fast mix, essentially this is what I am eliminating
(along with trading off the channel inner loop for 8 functions).

If I really had to do it in a "generic" way, I would make each sink
install a set of mixing functions (to avoid the giant case statement
in pa_mix and "generic" code), these mixing functions would be
specialized for the output of the sink.    Because pulse already
resamples everything to one output type for a sink, the sink only
needs to worry about one special case and can focus on making that
fast.    It could also bring about a few trade off such as "treat all
channels as only having 1 volume" or "no software volume" that could
be used for.   Then for pulse overall we could optimize the most
common output types and use the current "generic" code as a fallback.

Keith Preston