Re: [PATCH 0/7] block-sha1: improved SHA1 hashing

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 6 Aug 2009 16:25:10 -0700 (PDT)

On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> It was prescott that changed a lot (mostly for the worse - the shifter was 
> one of the few upsides of prescott, although increased frequency often 
> made up for the downsides).

Anyway, since you have a Northwood, I bet that the #1 issue for you is to 
spread out the shift instructions in a way that simply doesn't matter 
anywhere else.

In netburst, if I remember the details correcty, a "complex instruction" 
will basically get the trace cache from the microcode roms. I'm not sure 
how it interacts with the TC entries around it, but it's entirely possible 
that it basically disables any instruction scheduling (the microcode 
traces are presumably "pre-scheduled"), so you'd basically see stalls 
where there's little out-of-order execution.

That then explains why you see huge differences from what is basically 
trivial scheduling decisions, and why some random placement of a shift 
makes a big difference.

Just out of curiosity, does anything change if you change the

	B = SHA_ROR(B,2)

into a

	B = SHA_ROR(SHA_ROR(B,1),1)

instead? It's very possible that it becomes _much_ worse, but I guess it's 
also possible in theory that a single-bit rotate ends up being a simple 
instruction and that doing two single-bit ROR's is actually faster than 
one 2-bit ROR (assuming the second one is microcoded and the first one).

In particular, I'm thinking about the warnign in the intel optimization 
manual:

	The rotate by immediate and rotate by register instructions are 
	more expensive than a shift. The rotate by 1 instruction has the 
	same latency as a shift.

so it's very possible that "rotate by 1" is much better than other 
rotates.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html