Re: [PATCH 0/7] block-sha1: improved SHA1 hashing

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 6 Aug 2009 17:13:04 -0700 (PDT)

On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> In particular, I'm thinking about the warnign in the intel optimization 
> manual:
> 
> 	The rotate by immediate and rotate by register instructions are 
> 	more expensive than a shift. The rotate by 1 instruction has the 
> 	same latency as a shift.
> 
> so it's very possible that "rotate by 1" is much better than other 
> rotates.

Hmm. Probably not. Googling more seems to indicate that rotates and shifts 
have a fixed 4-cycle latency on Northwood. I'm not seeing anything that 
indicates that a single-bit rotate/shift would be any faster.

(And remember, if 4 cycles doesn't sound so bad: that's enough of a 
latency to do _16_ "simple" ALU's, since they can be double-pumped in the 
two regular ALU's).

I think long-running ALU ops that feed into a store (spill) also happen to 
be the thing that makes the dreaded store-buffer replay trap nasties 
happen more (load vs store scheduled badly, and then you end up spending 
tens of cycles just replaying).

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html