Sven Neumann <sven@xxxxxxxx> wrote: >The code is combining the multiplications done on 2 channels of the >same pixel into one. Also it is also meant as an example of what can >be done without using CPU-specific instructions. here's another example (4 x 8bit saturated addition): uint32 padd_sat_4x8(uint32 a, uint32 b) { uint32 ta, tb, tm, q, u, m; /* save overflow-causing bits in ta, tb */ ta = a & 0x80808080; tb = b & 0x80808080; q = a + b - (ta + tb); /* determine overflow conditions */ tm = ta | tb; u = (ta & tb) | (q & tm); /* u now contains overflow bits, propagate them over fields */ m = (u << 1) - (u >> 7); return (q + tm - u) | m; } This is completely portable, and should be a good deal faster than conditionally adding each component separately, at least on modern superscalar machines with expensive unpredicted branches. And benchmarks confirm this Extending the above to 8 x 8bit (using 64-bit integers) is trivial of course