On Mon, 17 Aug 2009, Giuseppe Scrivano wrote: > > Thanks for the hint. I tried gcc-4.4 and it produces slower code than > 4.3 on the gnulib SHA1 implementation and my patch makes it even more! Check out the asm, see if you can see why. One of the most common problems with P4's is literally that you end up loading from the same stack slot that you just stored to (gcc can do some really crazy spills), and that causes a store buffer hazard replay. My personal opinion is that Netburst is useless for trying to optimize C code for. It's just too random. > I noticed that on my machine your implementation is ~30-40% faster using > SHA_ROT for rol/ror instructions than inline assembly, at least with the > test-case Pádraig wrote. Am I the only one reporting it? I bet it's the same thing. Small perturbations of the source causing small changes to register allocation and thus spilling, and then Netburst goes crazy one way or another. It's interestign trying to fix it, and very frustrating. My workstation is a Nehalem (but Core 2 will have pretty much the same behavior), and it doesn't have the crazy netburst behavior. Shorter and simpler code generally performs better (which is _not_ true on Netburst). On my machine, for example, forcing gcc to do those rotates on registers is the difference between ~381MB/s and 415MB/s. And that's mainly because it makes gcc keep A-E in registers, rather than trying to cache the array[] references. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html