On Thu, 6 Aug 2009, Linus Torvalds wrote: > > It was prescott that changed a lot (mostly for the worse - the shifter was > one of the few upsides of prescott, although increased frequency often > made up for the downsides). Anyway, since you have a Northwood, I bet that the #1 issue for you is to spread out the shift instructions in a way that simply doesn't matter anywhere else. In netburst, if I remember the details correcty, a "complex instruction" will basically get the trace cache from the microcode roms. I'm not sure how it interacts with the TC entries around it, but it's entirely possible that it basically disables any instruction scheduling (the microcode traces are presumably "pre-scheduled"), so you'd basically see stalls where there's little out-of-order execution. That then explains why you see huge differences from what is basically trivial scheduling decisions, and why some random placement of a shift makes a big difference. Just out of curiosity, does anything change if you change the B = SHA_ROR(B,2) into a B = SHA_ROR(SHA_ROR(B,1),1) instead? It's very possible that it becomes _much_ worse, but I guess it's also possible in theory that a single-bit rotate ends up being a simple instruction and that doing two single-bit ROR's is actually faster than one 2-bit ROR (assuming the second one is microcoded and the first one). In particular, I'm thinking about the warnign in the intel optimization manual: The rotate by immediate and rotate by register instructions are more expensive than a shift. The rotate by 1 instruction has the same latency as a shift. so it's very possible that "rotate by 1" is much better than other rotates. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html