On Thu, 6 Aug 2009, Artur Skawina wrote: > > Does this make any difference for you? For me it's the best one so far > (the linusas2 number clearly shows that for me the register renaming does > nothing; other than that the functions should be very similar) Nope. If anything, it's bit slower, but it might be in the noise. I generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, which matches the linusas2 numbers pretty exactly. But it seems to make a big difference for you. Btw, _what_ P4 do you have (Northwood or Prescott)? The Intel optimization manuals very much talk about avoiding rotates. And they mention "with a CPUID signature corresponding to family 15 and model encoding of 0, 1, or 2" specifically as being longer latency. That's basically pre-prescott P4, I think. Anyway, on P4 I think you have two double-speed integer issue ports (ie max four ops per cycle), but only one of them takes a rotate, and only in the first half of the cycle (ie just one shift per cycle). And afaik, that is actually the _improved_ state in Prescott. The older P4's didn't have a full shifter unit at all, iirc: shifts were "complex instructions" in Northwood and weren't even single-clock. In Core 2, I think there's still just one shifter unit, but at least it's as fast as all the other units. So P4 really does stand out as sucking as far as shifts are concerned, and if you have an older P4, it will be even worse. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html