On Thu, 6 Aug 2009, Linus Torvalds wrote: > > In particular, I'm thinking about the warnign in the intel optimization > manual: > > The rotate by immediate and rotate by register instructions are > more expensive than a shift. The rotate by 1 instruction has the > same latency as a shift. > > so it's very possible that "rotate by 1" is much better than other > rotates. Hmm. Probably not. Googling more seems to indicate that rotates and shifts have a fixed 4-cycle latency on Northwood. I'm not seeing anything that indicates that a single-bit rotate/shift would be any faster. (And remember, if 4 cycles doesn't sound so bad: that's enough of a latency to do _16_ "simple" ALU's, since they can be double-pumped in the two regular ALU's). I think long-running ALU ops that feed into a store (spill) also happen to be the thing that makes the dreaded store-buffer replay trap nasties happen more (load vs store scheduled badly, and then you end up spending tens of cycles just replaying). Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html