Re: Linus' sha1 is much faster!

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 16 Aug 2009 15:47:08 -0700 (PDT)

On Mon, 17 Aug 2009, Giuseppe Scrivano wrote:
> 
> Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
> 4.3 on the gnulib SHA1 implementation and my patch makes it even more!

Check out the asm, see if you can see why. One of the most common problems 
with P4's is literally that you end up loading from the same stack slot 
that you just stored to (gcc can do some really crazy spills), and that 
causes a store buffer hazard replay.

My personal opinion is that Netburst is useless for trying to optimize C 
code for. It's just too random.

> I noticed that on my machine your implementation is ~30-40% faster using
> SHA_ROT for rol/ror instructions than inline assembly, at least with the
> test-case Pádraig wrote.  Am I the only one reporting it?

I bet it's the same thing. Small perturbations of the source causing small 
changes to register allocation and thus spilling, and then Netburst goes 
crazy one way or another. It's interestign trying to fix it, and very 
frustrating.

My workstation is a Nehalem (but Core 2 will have pretty much the same 
behavior), and it doesn't have the crazy netburst behavior. Shorter and 
simpler code generally performs better (which is _not_ true on Netburst). 

On my machine, for example, forcing gcc to do those rotates on registers 
is the difference between ~381MB/s and 415MB/s. And that's mainly because 
it makes gcc keep A-E in registers, rather than trying to cache the 
array[] references.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html