>> There is also an array lookup of the channel volume (every for loop >> cycle), and two increment variables. With an ARM processor this is >> probably enough extra variables to go past the number of registers and >> cause stack manipulation. The easiest things would to be to process >> one channel at a time, incrementing your pointer properly and using >> the end of the array pointer as a stop point instead of keeping two >> count variables. I also hope you have optimizations turned on in your >> compiler or you will get a divide instead of a shift. > > Hmm, going through the memory block more than once, wouldn't that be > bad for the cache? > Cache is only going to help you if you are reading the same address multiple times. If you read each address once cache will not help you. I guess it is possible because they are 16 bit samples that you would be reading each 32 bits from memory twice, if you went through the channels separately. However you still really aren't destroying your cache, unless your buffer is bigger then the CPU cache (or half of its size). Typically pulse will be dealing with 4k-8k buffers and I don' t know any cpus with that small of cache. > The volume array is just two entries in most cases > (i.e. stereo). Shouldn't be that bad. It still adds local variables and every array lookup is the equivalent of a multiply (array[i] is &array + sizeoftype * i), this is still much slower then if it was in a register. Plus the increment and if branch at the bottom can kill loop performance, especially because in the case of 2 channel the branch changes every cycle. > > I'd be very happy to take patches like this if they are clean and not > too invasive. Having special code for certain channel setups is > absolutely fine for me. > > Please make sure to submit patches like this one upstream! > I've been trying to figure out a way to make the patch less invasive, but I do have performance as a primary goal. I hope that I have enough time to clean them up and get them upstream. > What surprises me however is that your mixing is still fast even if > you iterate through your buffers all at the same time. I have not > spent much time doing optimization work, however, I'd assume that > improving locality of the data (i.e. by not looking at all mix buffers > at the same time, but just two) would bring the best speed benefits. Locality really only helps if you are doing lots of operation on one piece of memory. Since we are in a streaming processing mode we typically only see data once as it goes through. Memory access is generally not something we can improve in this situation. This generally means that we are limited by the arithmetic and branching operations we are doing. Although locality and cache can help to speed up operation in a stream processing case, you are often better suited by shrinking you inner loops as much as possible and avoiding branches. After doing this, compiler are generally better at optimizing them most other tricks. Take for example the current pa_mix function: for (d = 0;; d += sizeof(int16_t)) { int32_t sum = 0; unsigned i; if (PA_UNLIKELY(d >= length)) goto finish; for (i = 0; i < nstreams; i++) { pa_mix_info *m = streams + i; int32_t v, cv = m->linear[channel].i; if (PA_UNLIKELY(d >= m->chunk.length)) goto finish; if (PA_UNLIKELY(cv <= 0) || PA_UNLIKELY(!!mute) || PA_UNLIKELY(linear[channel] <= 0)) v = 0; else { v = *((int16_t*) m->ptr); v = (v * cv) / 0x10000; } sum += v; m->ptr = (uint8_t*) m->ptr + sizeof(int16_t); } sum = PA_CLAMP_UNLIKELY(sum, -0x8000, 0x7FFF); sum = (sum * linear[channel]) / 0x10000; *((int16_t*) data) = (int16_t) sum; data = (uint8_t*) data + sizeof(int16_t); if (PA_UNLIKELY(++channel >= spec->channels)) channel = 0; } Here are a few problem spots that I see: if (PA_UNLIKELY(d >= m->chunk.length)) goto finish; This is completely unnecessary in the loop (especially the inner one) and could be tested before hand. v = (v * cv) / 0x10000; I prefer to explicitly make this a shift and not leave it to the compiler to help us out. if (PA_UNLIKELY(cv <= 0) || PA_UNLIKELY(!!mute) || PA_UNLIKELY(linear[channel] <= 0)) v = 0; else { v = *((int16_t*) m->ptr); v = (v * cv) / 0x10000; } This three way if statement is often much worse then just doing a multiply against 0; It costs three different tests in every case and only save one multiply (typically processor can tack on a shift at the end of a multiply so that is free) in a rare case. If you really want to check for negative values then you could probably do it before the loop and just leave out the channels from computation. for (d = 0;; d += sizeof(int16_t)) { m->ptr = (uint8_t*) m->ptr + sizeof(int16_t); Because we are dealing with int16_t in this case all of our pointers should be of that type. This gives the most information to the compiler for optimization. By using other types of pointer, because i would assume it made cleaner looking code could slow us down if (PA_UNLIKELY(++channel >= spec->channels)) channel = 0; I still really don't like this, because the branch changes frequently (kills branch prediction). Because the loop is made for a generic number of channels you force the compile to do an array lookup on the channel volume every time as opposed to letting the compiler possibly optimizing it into a register. If you look at fast mix, essentially this is what I am eliminating (along with trading off the channel inner loop for 8 functions). If I really had to do it in a "generic" way, I would make each sink install a set of mixing functions (to avoid the giant case statement in pa_mix and "generic" code), these mixing functions would be specialized for the output of the sink. Because pulse already resamples everything to one output type for a sink, the sink only needs to worry about one special case and can focus on making that fast. It could also bring about a few trade off such as "treat all channels as only having 1 volume" or "no software volume" that could be used for. Then for pulse overall we could optimize the most common output types and use the current "generic" code as a fallback. Keith Preston