Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Linus Torvalds wrote: >> In particular, I'm thinking about the warnign in the intel optimization >> manual: >> >> The rotate by immediate and rotate by register instructions are >> more expensive than a shift. The rotate by 1 instruction has the >> same latency as a shift. >> >> so it's very possible that "rotate by 1" is much better than other >> rotates. > > Hmm. Probably not. Googling more seems to indicate that rotates and shifts > have a fixed 4-cycle latency on Northwood. I'm not seeing anything that > indicates that a single-bit rotate/shift would be any faster. > > (And remember, if 4 cycles doesn't sound so bad: that's enough of a > latency to do _16_ "simple" ALU's, since they can be double-pumped in the > two regular ALU's). looking at the generated code, there is a lot of ro[rl] movement, so it's likely that contributes to the problem. I also see 44 extra lea instructions, 44 less adds and changes like: [...] mov XX(%eRX),%eRX xor XX(%eRX),%eRX - and %eRX,%eRX + and XX(%eRX),%eRX xor XX(%eRX),%eRX - add %eRX,%eRX - ror $0x2,%eRX - mov %eRX,XX(%eRX) + lea (%eRX,%eRX,1),%eRX mov XX(%eRX),%eRX bswap %eRX mov %eRX,XX(%eRX) mov %eRX,%eRX + ror $0x2,%eRX + mov %eRX,XX(%eRX) + mov %eRX,%eRX rol $0x5,%eRX mov XX(%eRX),%eRX - mov XX(%eRX),%eRX [...] which could mean that gcc did a better job of register allocation (where "better job" might be just luck). artur -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html