Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> For those curious just how close the C version is to the various >> asm and C implementations, the q&d microbenchmark is at >> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz > > Hmm. That thing doesn't work at all on x86-64. Even apart from the asm > sources, your timing thing does soem really odd things (why do you do that > odd "iret" in GETCYCLES and GETTIME?). You're better off using > lfence/mfence/cpuid, and I think you could make it work on 64-bit that > way too. yes, it's 32-bit only, i should have mentioned that. The timing code was written more than a decade ago, it really works on p2, haven't updated it, it's all just c&p'ed ever since. All of it can be safely disabled; on p2 you could account for every cycle, nowadays gettimeofday is more than enough. > I just hacked it away for testing. > >> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom. > > I'll use this to see if I can improve the 32-bit case. > > On Nehalem, with your benchmark, I get: > > # TIME[s] SPEED[MB/s] > rfc3174 5.122 119.2 > # New hash result: d829b9e028e64840094ab6702f9acdf11bec3937 > rfc3174 5.153 118.5 > linus 2.092 291.8 > linusas 2.056 296.8 > linusas2 1.909 319.8 > mozilla 5.139 118.8 > mozillaas 5.775 105.7 > openssl 1.627 375.1 > spelvin 1.678 363.7 > spelvina 1.603 380.8 > nettle 1.592 383.4 > > And with the hacked version to get some 64-bit numbers: > > # TIME[s] SPEED[MB/s] > rfc3174 3.992 152.9 > # New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab > rfc3174 3.991 152.9 > linus 1.54 396.3 > linusas 1.533 398.1 > linusas2 1.603 380.9 > mozilla 4.352 140.3 > mozillaas 4.227 144.4 > > so as you can see, your improvements in 32-bit mode are actually > de-provements in 64-bit mode (ok, your first one seems to be a tiny > improvement, but I think it's in the noise). Actually i didn't keep anything that wasn't a win, one reason why linusas2 stayed was that it really surprised me, i'd have expected for gcc to do a lot worse w/ the many temporaries and the compiler came up w/ a 70% gain; gcc really must have improved when i wasn't looking. > But you're right, I need to try to improve the 32-bit case. I never said anything like that. :) there probably isn't all that much that can be done. I tried a few things, but never saw any improvement above measurement noise (a few percent). Would have though that overlapping the iterations a bit would be a gain, but that didn't do much (-20%..0), maybe on 64 bit, with more registers... Oh, i noticed that '-mtune' makes quite a difference, it can change the relative performance of the functions significantly, in unobvious ways; depending on which cpu gcc tunes for (build config or -mtune); some implementations slow down, others become a bit faster. artur -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html